Find answers from the community

Updated 2 months ago

Why is metadata length tied to chunk

Why is metadata length tied to chunk size? I would expect chunk size to apply only to the text chunk itself.
ValueError: Metadata length (407) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.
L
C
M
12 comments
because metadata is shown to the LLM and embedding models by default
@Logan M well if that's the case why doesn't setting include_metadata to False, turn off that behavior?
SentenceSplitter(chunk_overlap=0, include_metadata=False, chunk_size=128)
eh, good question, it probably should
Also be curious to know why metadata inclusion is the default behavior for creating an embedding. I can understand adding a few key items of metadata to an embedding, but not all of it, that would make the embedding very noisy, especially if you have a lot of it.
Mostly because the metadata typically contains useful info needed for querying (i.e like a filename). It was a design decision at one point and it's kind of stuck like that for now I think

Following that link above, you can configure this pretty granularly
OK, now I can see where you can exclude the embed metadata_keys as part of Document instantiation. Yes, very granular indeed. So I think I'll try creating some Documents now. So what does the boolean flag include_metadata do in the SentenceSplitter class?
So this line actually worked:
nodes = splitter.get_nodes_from_documents(docs, show_progress=True)
But I had to use the excluded_embed_metadata_keys param when creating the Documents.
This line works as well nodes = pipeline.run(documents=docs, show_progress=True) Again, only because I excluded the metadata from the embeddings.
I believe it's for setting if nodes should inherit the metadata from documents or not. I think there is a small bug, because if that's set, it shouldn't be considering the metadat length when chunking imo
I looked at the source code and I think I see where the error is.
hello, @Chris S. put all the unwanted metadata in excluded_llm_metadata_keys, to see what the llm will end up reading you can do :
from llama_index.schema import MetadataMode
node.get_content(metadata_mode=MetadataMode.LLM)
Add a reply
Sign up and join the conversation on Discord