The community member is new to using LlamaIndex and has questions about how it handles metadata extraction. They want to know if LlamaIndex converts metadata (like title, summary, Q&A, entities) into text and appends it to the original content before embedding, or if it embeds the metadata separately. They also want to know how to query to retrieve information based on the stored metadata.
The comments explain that the metadata is stored as text, and both the embedding model and language model can access it (this is configurable). The metadata will influence retrieval, and there are metadata filters available. When asked about filtering documents by publication year, the community members are told that this would require a custom LLM/function call, as LlamaIndex's AutoRetriever attempts to automate this, but the community member may get better accuracy by building a custom solution tailored to their documents and language model.
When working with Metadata Extraction, I don’t understand how LlamaIndex uses the extracted metadata (how it is stored and queried to retrieve the information). Does LlamaIndex convert metadata (such as title, summary, Question Answer, Entity) into text, then append it to the original content before embedding, or does it embed the metadata separately?
Additionally, how can I query to retrieve the correct information based on the stored metadata? I would greatly appreciate any help from everyone.
Thank you for your answer. Suppose I have documents with a metadata field for the publication year. How can I accurately filter documents published in 2025 with a query like, "Answer the question based on documents published in 2025"? I am aware of Metadata Filtering supported by databases, but I would need to define the filters beforehand in LlamaIndex. My question is: does LlamaIndex have any tool to automatically extract metadata-related information from a query and then apply filtering, or would I need to write a custom tool(like define Function call ) for this purpose ?
Yea that'd be a custom llm/function call to infer that
There is an AutoRetriever in the framework that attempts to automate this for you, but imo you'll have better accuracy (and easier to debug) if you build it around the scope of your documents and llm instead