Find answers from the community

Updated 3 months ago

Embeddings

general question according to the documentation the dafault chunking strategy is automatically enabled with chunksize=1024 overlap=20 ; if i parse with node_parser = MarkdownNodeParser()
transformations = [node_parser] each node contains 1024 token ist this assumption correct? If yes the next step is the vectorization, i want leverage a multilingual embedder like sentence-transformers/paraphrase-multilingual-mpnet-base-v2 i think this has a 128 length of 128 tokens. the vectorization goes fine, but this means EACH NODE from contains 1024 token, captures only 128 tokens as vectors????
L
o
3 comments
I would not use a model like that for embeddings, unless your data makes sense at a 128 chunk size. 128 is tiny tiny

The markdown node parser has no chunk size, it just chunks according to markdown elements. You might want to chain that with a normal text splitter
whats a good one also for german language
Add a reply
Sign up and join the conversation on Discord