Find answers from the community

Updated last year

If I m not mistaken meta data extraction

If I'm not mistaken, meta data extraction occurs AFTER nodes creation, thus it does not take the token limit into account. Shouldn't meta data extraction be integrated within text splitters instead ?
L
t
οΏ½
8 comments
When a document is split into nodes, the metadata is considered
in https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/simple.py
we have a call to get_nodes_from_document for each document, which will request the text splitter to create nodes taking into account the metadata so far (at this stage the metadata is either a custom file_metadata or is created by the text splitter/reader -- for instance pdf reader will create a page metadata):
Plain Text
nodes = get_nodes_from_document(
                    document,
                    self._text_splitter,
                    self._include_metadata, 
                     ... )

then once all nodes have been generated meta data extractor enters:
Plain Text
self._metadata_extractor.process_nodes(all_nodes)


but maybe I'm wrong ? πŸ₯΄
ohhh this is for metadata extraction
yea that's fair. I think on one hand, it's easier to extract metadata from smaller nodes. Plus, most of the metadata extraction stuff is most useful for embeddings (and openai embeddings have an 8K token limit)
Like, I think most of it is toggled to be only used for embeddings too? I'd have to double check
Well, it won't be a problem to generate embeddings I think because most of the time, people use 512/1024 chunk size and the additional metadata won't break the 8k limit.

But it could be more problematic when querying the index with models like gpt35-turbo that only have 4k ... with a top_k>=4 and with embeddings > 1k it could lead to a "token limit reached" error/warning more often ...

I'm not sure how this can be fixed because you can't predict the token size of the additional metadata ... so far the most sensible thing to do would be to warn the user/dev that he has to take this into account ...
Right, but what I was getting at is I think most of the Metadata extracted is configured to be only used with embeddings, not the LLM πŸ€” I agree though, tricky problem if it does get used in the LLM

But I see by default it will get used for both, lame
https://github.com/jerryjliu/llama_index/blob/d3029e8ef22c53d858467e761b4d11eb0f7c9abc/llama_index/node_parser/extractors/metadata_extractors.py#L104
Yeah I was wondering about best practices for this. Do people concat metasata with text or simply use it as filtering parameters (or secondary vectors) in their vector DB
Add a reply
Sign up and join the conversation on Discord