If I m not mistaken meta data extraction

At a glance

If I'm not mistaken, meta data extraction occurs AFTER nodes creation, thus it does not take the token limit into account. Shouldn't meta data extraction be integrated within text splitters instead ?

�

8 comments

LLogan M

When a document is split into nodes, the metadata is considered

ttilleul

in https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/simple.py
we have a call to get_nodes_from_document for each document, which will request the text splitter to create nodes taking into account the metadata so far (at this stage the metadata is either a custom file_metadata or is created by the text splitter/reader -- for instance pdf reader will create a page metadata):

Plain Text

nodes = get_nodes_from_document(
                    document,
                    self._text_splitter,
                    self._include_metadata, 
                     ... )

then once all nodes have been generated meta data extractor enters:

Plain Text

self._metadata_extractor.process_nodes(all_nodes)

but maybe I'm wrong ? 🥴

LLogan M

ohhh this is for metadata extraction

LLogan M

yea that's fair. I think on one hand, it's easier to extract metadata from smaller nodes. Plus, most of the metadata extraction stuff is most useful for embeddings (and openai embeddings have an 8K token limit)

LLogan M

Like, I think most of it is toggled to be only used for embeddings too? I'd have to double check

ttilleul

Well, it won't be a problem to generate embeddings I think because most of the time, people use 512/1024 chunk size and the additional metadata won't break the 8k limit.

But it could be more problematic when querying the index with models like gpt35-turbo that only have 4k ... with a top_k>=4 and with embeddings > 1k it could lead to a "token limit reached" error/warning more often ...

I'm not sure how this can be fixed because you can't predict the token size of the additional metadata ... so far the most sensible thing to do would be to warn the user/dev that he has to take this into account ...

LLogan M

Right, but what I was getting at is I think most of the Metadata extracted is configured to be only used with embeddings, not the LLM 🤔 I agree though, tricky problem if it does get used in the LLM

But I see by default it will get used for both, lame
https://github.com/jerryjliu/llama_index/blob/d3029e8ef22c53d858467e761b4d11eb0f7c9abc/llama_index/node_parser/extractors/metadata_extractors.py#L104

�𓅬 gabriel_syme 𓅬

Yeah I was wondering about best practices for this. Do people concat metasata with text or simply use it as filtering parameters (or secondary vectors) in their vector DB

Add a reply

Find answers from the community

If I m not mistaken meta data extraction