There is something I don't understand :

At a glance

The community member is trying to implement a small-to-big retriever, but is encountering an error message related to the metadata length being longer than the chunk size. The community members discuss that the metadata is embedded by default and can impact the splitting process. They suggest setting excluded_embed_metadata_keys and excluded_llm_metadata_keys to ignore specific metadata. However, the community member still encounters issues with the metadata length, even after trying to exclude all metadata from the embedding. The community members suggest using a custom node postprocessor or subclassing the node parser to ignore metadata. They also note that the excluded_embed_metadata_keys attribute is not used in the get_nodes_from_documents method, and that include_metadata refers to nodes inheriting metadata from their parent documents.

Useful resources

ttatanfort

There is something I don't understand :
The meta data are not embedded, so they shouldn't impact the split process. However, I'm trying to implement a small to big retriever but with small chunk size I have this error message :

"Metadata length (130) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this."

Can you explain the reason why and how to make the metadata not affect the splitting process.

Here is the piece of code I use :

sub_chunk_sizes = [128, 256, 512]
sub_node_parsers = [
SentenceSplitter.from_defaults(chunk_size=c,chunk_overlap=20) for c in sub_chunk_sizes
]

all_nodes = []
for base_node in tqdm(base_nodes):
for n in sub_node_parsers:
sub_nodes = n.get_nodes_from_documents([base_node])

24 comments

LLogan M

Metadata is embedded though. And it does impact the splitting process.

Set excluded_embed_metadata_keys and excluded_llm_metadata_keys to ignore specific metadata

The parsers use the longest type of metadata when splitting

Attachment

LLogan M

You will probably find this section of the docs helpful
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html#advanced-metadata-customization

ttatanfort

ho !! so by default it embedde the metadata?! that's a weird default behaviour though. I'll check that to turn it off. Is there a way to only includes some of the metadatas in the embedding (like the document name for instance) ?
Thanks for the answer bro 🙂

LLogan M

yea you can set which metadata is seen by the embeddings and which metadata is seen by the LLM

Plain Text

document.excluded_llm_metadata_keys = ["key1", ..]
document.excluded_embed_metadata_keys = ["key1", ..]

LLogan M

So you can control which data is used for embeddings/retrieval and which is used during response synthesis

ttatanfort

that's awesome! thanks a lot it's very helpfull 🙌

ttatanfort

@Logan M, I tried your method to remove the metadatas to embedde from the documents.
However when I try to embedde my sub nodes (I'm trying to build a small to big retriever) I still have a message displaying :

Metadata length (108) is close to chunk size (128). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.

Any idea why and what can I do about it please? 🙌

I even tried to removed all the metadatas from the embedding cf :

sub_nodes[0].excluded_embed_metadata_keys

Out[50]:

['reference',
'url',
'issu_jurisprudence',
'cour',
'type_chambre',
'chambre',
'date_execution',
'reference_jurisprudence',
'publie',
'origine',
'date_execution_date_format',
'url']

sub_nodes[0].metadata

Out[51]:

{'reference': 'Cour de cassation, civile, Chambre civile 2, 6 janvier 2022, 20-12.220, Inédit',
'url': 'https://www.legifrance.gouv.fr/juri/id/JURITEXT000045009675',
'issu_jurisprudence': 'rejet',
'cour': 'cour de cassation',
'type_chambre': 'civile 2',
'chambre': 'civile',
'date_execution': '6 janvier 2022',
'reference_jurisprudence': '20-12.220',
'publie': 'Inédit',
'origine': 'jurisprudence judiciaire',
'date_execution_date_format': '2022-01-06T00:00:00'}

LLogan M

I think you need to set both excluded_embed_metadata_keys and excluded_llm_metadata_keys

ttatanfort

yep, I've set both :
I wanted to keep some metadata for my llm though. So :

sub_nodes[0].excluded_llm_metadata_keys

Out[52]:

['url', 'date_execution_date_format']

ttatanfort

it seems like the only thing that work is removing the metadata from both llm and embed... However, I would like to use some of the metadata in my RAG.

LLogan M

Process the nodes, then set the exclude keys to what you actually want before inserting the nodes ?

ttatanfort

@Logan M The only way around I found was to set all the metadata keys to both excluded_llm_metadata_keys and excluded_embed_metadata_keys, and then index the nodes in my vector store (Qdrant).
Once embedded and inserted in my qdrant collection, is there a way to add some of the metadatas to my llm when I retrieve the relevant chunks ?

LLogan M

You can write a custom node postprocessor

https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/root.html#custom-node-postprocessor

index.as_query_engine(node_postprocessors=[MyPostProcessor()])

ttatanfort

@Logan M
This is the way I create the documents from my pandas dataframe. The goal is to create a small to big retriever. But I can't make the metadata being taken into account by the llm but not embedded... (setting up both excluded_embed_metadata_keys and excluded_llm_metadata_keys allowed me to index my documents but I have some issues with my VectorIndexAutoRetriever then)

Did I do something wrong that created the error :
"Error at chunk index 4: Metadata length (239) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this."

My processing code attached :

kkapa.ai

Let me check the knowledge base for you. One moment! :books:

kkapa.ai

Sorry @tatanfort, I am experiencing especially high traffic right now, please try again. :technologist:

ttatanfort

@kapa.ai how to make IngestionPipeline with subnodes (i.e small to big) ?

LLogan M

I'm not entirely sure if that's possible with the ingestion pipeline 🤔 but tbh I good problem to figure out how to do

ttatanfort

@Logan M any idea on this one? to be honest, I racked my brain about it and it seems like if I don't set BOTH excluded_embed_metadata_keys AND excluded_llm_metadata_keys to all the keys, they are always taken into account in the embedding and while creating my sub nodes (for small to big retriever) I keep having the error :

"Error at chunk index 4: Metadata length (239) is longer than chunk size (128). Consider increasing the chunk size or decreasing the size of your metadata to avoid this." ...

I would like to just remove the metadata from the embedding but keep them with the chunks retrieved and given to the llm :/

LLogan M

One alternative is just to subclass the node parser to ignore metadata (it's just one function to override)

That or a PR to add an ignore metadata mode

ttatanfort

@Logan M I deep dived into the code and it seems like the attribute excluded_embed_metadata_keys is actually never used in the get_nodes_from_documents method. Just a parameter "include_metadata" by default at True. Is that normal ?

Attachment

Capture_decran_2024-01-03_a_17.28.22.png

LLogan M

that is normal. include_metadata refers to nodes inheriting metadata from their parent documents

SSarvagya Porwal

Here are the official docs catering to the issue... It's a feature rather than a bug...

https://stackoverflow.com/questions/77694999/llama-index-sentence-splitter-is-limited-by-metadata-length/78702766#78702766

SSarvagya Porwal

Pls upvote if it helps

Add a reply

Find answers from the community

There is something I don't understand :