Why is metadata length tied to chunk

At a glance

The post discusses an issue where the metadata length is tied to the chunk size, causing a ValueError when the metadata length exceeds the chunk size. Community members discuss that this is because the metadata is shown to the language model and embedding models by default. They explore ways to configure this behavior, such as using the include_metadata flag in the SentenceSplitter class and excluding specific metadata keys from the embeddings. The discussion suggests that the default inclusion of all metadata in the embeddings may not be the optimal design choice, as it can make the embeddings noisy, especially when there is a lot of metadata. The community members provide suggestions on how to work around this issue, such as using the excluded_llm_metadata_keys parameter when creating documents.

Useful resources

CChris S.

Why is metadata length tied to chunk size? I would expect chunk size to apply only to the text chunk itself.

ValueError: Metadata length (407) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

12 comments

LLogan M

because metadata is shown to the LLM and embedding models by default

LLogan M

https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html#advanced-metadata-customization

CChris S.

@Logan M well if that's the case why doesn't setting include_metadata to False, turn off that behavior?
SentenceSplitter(chunk_overlap=0, include_metadata=False, chunk_size=128)

LLogan M

eh, good question, it probably should

CChris S.

Also be curious to know why metadata inclusion is the default behavior for creating an embedding. I can understand adding a few key items of metadata to an embedding, but not all of it, that would make the embedding very noisy, especially if you have a lot of it.

LLogan M

Mostly because the metadata typically contains useful info needed for querying (i.e like a filename). It was a design decision at one point and it's kind of stuck like that for now I think

Following that link above, you can configure this pretty granularly

CChris S.

OK, now I can see where you can exclude the embed metadata_keys as part of Document instantiation. Yes, very granular indeed. So I think I'll try creating some Documents now. So what does the boolean flag include_metadata do in the SentenceSplitter class?

CChris S.

So this line actually worked:
nodes = splitter.get_nodes_from_documents(docs, show_progress=True)
But I had to use the excluded_embed_metadata_keys param when creating the Documents.

CChris S.

This line works as well nodes = pipeline.run(documents=docs, show_progress=True) Again, only because I excluded the metadata from the embeddings.

LLogan M

I believe it's for setting if nodes should inherit the metadata from documents or not. I think there is a small bug, because if that's set, it shouldn't be considering the metadat length when chunking imo

CChris S.

I looked at the source code and I think I see where the error is.

MMohannad

hello, @Chris S. put all the unwanted metadata in excluded_llm_metadata_keys, to see what the llm will end up reading you can do :
from llama_index.schema import MetadataMode
node.get_content(metadata_mode=MetadataMode.LLM)

Add a reply

Find answers from the community

Why is metadata length tied to chunk