The community member has a question about the default chunking strategy and the vectorization process. They are using a MarkdownNodeParser with a chunk size of 1024 tokens, and they want to use a multilingual embedder like sentence-transformers/paraphrase-multilingual-mpnet-base-v2 which has a length of 128 tokens.
In the comments, another community member suggests not using a model like that for embeddings, as 128 tokens is very small. They also mention that the MarkdownNodeParser does not have a chunk size, and it just chunks according to markdown elements. The community member recommends chaining the MarkdownNodeParser with a normal text splitter.
Another community member asks for a good multilingual embedder for the German language, and a third community member suggests trying the https://huggingface.co/jinaai/jina-embeddings-v2-base-de model.
general question according to the documentation the dafault chunking strategy is automatically enabled with chunksize=1024 overlap=20 ; if i parse with node_parser = MarkdownNodeParser() transformations = [node_parser] each node contains 1024 token ist this assumption correct? If yes the next step is the vectorization, i want leverage a multilingual embedder like sentence-transformers/paraphrase-multilingual-mpnet-base-v2 i think this has a 128 length of 128 tokens. the vectorization goes fine, but this means EACH NODE from contains 1024 token, captures only 128 tokens as vectors????