Embeddings

At a glance

The community member has a question about the default chunking strategy and the vectorization process. They are using a MarkdownNodeParser with a chunk size of 1024 tokens, and they want to use a multilingual embedder like sentence-transformers/paraphrase-multilingual-mpnet-base-v2 which has a length of 128 tokens.

In the comments, another community member suggests not using a model like that for embeddings, as 128 tokens is very small. They also mention that the MarkdownNodeParser does not have a chunk size, and it just chunks according to markdown elements. The community member recommends chaining the MarkdownNodeParser with a normal text splitter.

Another community member asks for a good multilingual embedder for the German language, and a third community member suggests trying the https://huggingface.co/jinaai/jina-embeddings-v2-base-de model.

Useful resources

ooedemis

general question according to the documentation the dafault chunking strategy is automatically enabled with chunksize=1024 overlap=20 ; if i parse with node_parser = MarkdownNodeParser()
transformations = [node_parser] each node contains 1024 token ist this assumption correct? If yes the next step is the vectorization, i want leverage a multilingual embedder like sentence-transformers/paraphrase-multilingual-mpnet-base-v2 i think this has a 128 length of 128 tokens. the vectorization goes fine, but this means EACH NODE from contains 1024 token, captures only 128 tokens as vectors????

3 comments

LLogan M

I would not use a model like that for embeddings, unless your data makes sense at a 128 chunk size. 128 is tiny tiny

The markdown node parser has no chunk size, it just chunks according to markdown elements. You might want to chain that with a normal text splitter

ooedemis

whats a good one also for german language

LLogan M

Try this https://huggingface.co/jinaai/jina-embeddings-v2-base-de

Add a reply

Find answers from the community

Embeddings