Find answers from the community

Updated 2 months ago

I'm a bit confused how metadata

I'm a bit confused how metadata extractors work:
  1. Is the metadata just used for retrieval or is it sent to the LLM as well?
  2. If the former, do we explicitly have to tell the index query engine to consider the metadata?
J
L
32 comments
I'm constructing my service_context using the metadata extractors like this:
Plain Text
summary_extractor = SummaryExtractor(summaries=["prev", "self", "next"], llm=llm)
questions_answered_extractor = QuestionsAnsweredExtractor(
    questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
)
title_extractor = TitleExtractor(llm=llm, nodes=5)
keyword_extractor = KeywordExtractor(llm=llm, keywords=10)

transformations = [node_parser,
                   summary_extractor,
                   questions_answered_extractor,
                   title_extractor,
                   keyword_extractor
                   ]

llama_logger = LlamaLogger()
service_context = ServiceContext.from_defaults(
    callback_manager=callback_manager,
    llm=llm,
    embed_model=embedding_model,
    node_parser=node_parser,
    llama_logger=llama_logger,
    transformations=transformations,
)


Which successfully builds my nodes with metadata like:
Plain Text
{
   "page_label":"1",
   "file_name":"....pdf",
   "db_document_id":"...",
   "patient_id":"...",
   "conversation_id":"...",
   "section_summary":"The key topics and entities in this section are:\n\n1. Patient Information:\n- Name: ...",
   "questions_this_excerpt_can_answer":"1. What is the primary insurance provider for ...",
   "document_title":"Insurance Data for Susan Ardmore Underwood",
   "excerpt_keywords":"SS, Date of Birth, Phone, Address, Zip, City, Employer, ...",
   "_node_content":"...",
   "_node_type":"TextNode",
   "document_id":"...",
   "doc_id":"...",
   "ref_doc_id":"..."
}


But I'm not sure how this metadata gets used. The service_context above is included both in the construction of my index query engine, and my QueryEngineTool.
Check out this guide, it will explain quite a bit.

By default, both the embeddings and LLM are seeing your metadata
https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html#advanced-metadata-customization
Thanks @Logan M ! Was super confused since my logs suggested they were being sent (which I want!) but this was misleading me
Attachment
image.png
oh mendable πŸ˜…
@Logan M on a related note, what's the right way to use excluded_llm_metadata_keys?

Mine are still showing as being sent to the LLM
Attachment
image.png
That looks right to me πŸ€” How are you using the documents? Just throwing into from_documents() or something else?
I'm using fetch_and_read_document to read the documents and then build an index like this:

Plain Text
llama_index_docs = []
index = VectorStoreIndex.from_documents(
    [],
    storage_context=storage_context,
    service_context=service_context,
    show_progress=True,
)
for doc in conversation.documents:
    try:
        llama_index_doc = fetch_and_read_document(doc)
        logger.info(f"Adding doc {conversation.documents.index(doc)+1} of {len(conversation.documents)} to index")

        for d in llama_index_doc:
            d.metadata['patient_id'] = str(doc.patient_id)
            d.metadata['conversation_id'] = str(doc.conversation_id)
            d.excluded_llm_metadata_keys = ["page_label", "file_name", "db_document_id", "patient_id", "conversation_id"]
            logger.info(f"Inserting document {doc.id} into index")
            index.insert(d)
            llama_index_docs.append(d)


I tried moving excluded_llm_metadata_keys from fetch_and_read_document to this section where I'm building the index, but this is also not working. Same result.
The only one that's excluded is the extractor that has metadata_mode=MetadataMode.EMBED up in my service_context above
Attachment
image.png
Ohhhh you are printing the payload to the retrieve event, not the actual input to the LLM
Sorry for not being clear... hopefully I'm doing this right. I thought by logging each of the CBEventTypes in the chat trace, I'd be able to see exactly what gets sent to the LLM. At least, that's how I interpreted the CBEventType.LLM event.
Plain Text
if event_type != CBEventType.EMBEDDING and event_type != CBEventType.AGENT_STEP:
    logger.info(f"\n\nEvent type {event_type}")
    if payload is not None:
        logger.info(f"\nHas the following payload:\n\n{json.dumps(payload, default=custom_serializer)}")


I'm seeing the should-be-excluded metadata throughout the chat trace:
Plain Text
**********
Trace: chat
    |_CBEventType.AGENT_STEP ->  6.944333 seconds
      |_CBEventType.LLM ->  2.012103 seconds
      |_CBEventType.FUNCTION_CALL ->  4.133827 seconds
        |_CBEventType.QUERY ->  4.13283 seconds
          |_CBEventType.LLM ->  2.221637 seconds
          |_CBEventType.SUB_QUESTION ->  1.215887 seconds
            |_CBEventType.QUERY ->  1.214849 seconds
              |_CBEventType.RETRIEVE ->  0.350428 seconds // [1] Includes it (attached)
                |_CBEventType.EMBEDDING ->  0.217844 seconds
              |_CBEventType.SYNTHESIZE ->  0.86362 seconds
                |_CBEventType.TEMPLATING ->  0.000141 seconds // [2] Includes it (attached)
                |_CBEventType.LLM ->  0.842278 seconds // [3] Includes it (attached)
          |_CBEventType.SYNTHESIZE ->  0.660717 seconds
            |_CBEventType.TEMPLATING ->  0.000109 seconds
            |_CBEventType.LLM ->  0.657789 seconds
      |_CBEventType.LLM ->  0.0 seconds
**********
Attachments
3.png
2.png
1.png
Is there a better way to debug this?

I already include LI's debugging/logging utilities (I think) in my chat engine, but it doesn't print exactly what gets sent to the LLM:
Plain Text
llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_handlers.append(llama_debug)
callback_manager = CallbackManager(callback_handlers)
...
llama_logger = LlamaLogger()
service_context = ServiceContext.from_defaults(
    callback_manager=callback_manager,
    llm=llm,
    embed_model=embedding_model,
    node_parser=node_parser,
    llama_logger=llama_logger,
    transformations=transformations,
)
Instead of relying on callbacks for this, try this

Plain Text
import llama_index

llama_index.set_global_handler("simple")


This will print exact LLM inputs and outputs
WOW that's way easier πŸ™ Unfortunately, the metadata is still showing:
Attachment
image.png
ok, let me replicate this
ok, so in a minmal example, I can tell this works

Plain Text
>>> import llama_index
>>> llama_index.set_global_handler("simple")
>>> from llama_index import Document, VectorStoreIndex
>>> document = Document(text='test', metadata={'file_name': 'fake.txt'})
>>> document.excluded_llm_metadata_keys = ['file_name']
>>> index = VectorStoreIndex.from_documents([document])
>>> query_engine = index.as_query_engine()
>>> response = query_engine.query("What is the file name?")
** Messages: **
system: You are an expert Q&A system that is trusted around the world.
Always answer the query using the provided context information, and not prior knowledge.
Some rules to follow:
1. Never directly reference the given context in your answer.
2. Avoid statements like 'Based on the context, ...' or 'The context information ...' or anything along those lines.
user: Context information is below.
---------------------
test
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the file name?
Answer: 
**************************************************
** Response: **
assistant: The file name is "test".
**************************************************


>>> 


Note that the metadata does not get shown to the LLM
So the real question is -- what did I do differently that you aren't doing πŸ˜…
I'm not explicitly doing document = Document(text='test', metadata={'file_name': 'fake.txt'}) like you / the Docs do

My fetch_and_read_document function returns List[LlamaIndexDocument] which is a list of Document objects
Plain Text
from llama_index.schema import Document as LlamaIndexDocument


When I call fetch_and_read_document, and iterate over each Document, VS Code lets me access the excluded_llm_metadata_keys property so I'd expect it to work? Only other major difference is I'm creating the index first, and then inserting nodes afterwards:
Plain Text
llama_index_docs = []
index = VectorStoreIndex.from_documents(
    [],
    storage_context=storage_context,
    service_context=service_context,
    show_progress=True,
)
for doc in conversation.documents:
    try:
        llama_index_doc = fetch_and_read_document(doc)
        logger.info(f"Adding doc {conversation.documents.index(doc)+1} of {len(conversation.documents)} to index")

        for d in llama_index_doc:
            d.metadata['patient_id'] = str(doc.patient_id)
            d.metadata['conversation_id'] = str(doc.conversation_id)
            d.excluded_llm_metadata_keys = ["page_label", "file_name", "db_document_id", "patient_id", "conversation_id"]
            logger.info(f"Inserting document {doc.id} into index")
            index.insert(d)
            llama_index_docs.append(d)


Any recommendations?
ok, let me modify my example a bit. I just want to replicate the issue

One question while I go and do this, did you start your index from scratch before this? If you have nodes inserted from before making this change, the old nodes won't have the new settings
I'm starting from scratch every time
so, slightly expanded implementation that still works

Plain Text
>>> import llama_index
>>> llama_index.set_global_handler("simple")

>>> from llama_index import SimpleDirectoryReader, VectorStoreIndex
>>> document = SimpleDirectoryReader("./docs/examples/data/paul_graham").load_data()[0]
>>> document.metadata
{'file_path': 'docs/examples/data/paul_graham/paul_graham_essay.txt', 'file_name': 'paul_graham_essay.txt', 'file_type': 'text/plain', 'file_size': 75042, 'creation_date': '2023-10-04', 'last_modified_date': '2023-10-04', 'last_accessed_date': '2023-12-13'}
>>> document.excluded_llm_metadata_keys = list(document.metadata.keys())

>>> index = VectorStoreIndex.from_documents([document])
>>> query_engine = index.as_query_engine()
>>> response = query_engine.query("What is the file name?")


And the resulting print to the terminal shows zero metadata πŸ€”
oh wait, I need to use insert
one more check
still works lol
I'm still not 100% sure what the difference between mine and your code actually is. It feel equivalent? The only thing I can think if is maybe you have some old version of llama-index installed?
like if you run my code, does it work for you?
I know you are using sub-question query engine, but all that does is run a normal query engine under the hood query_engine.query() -- just like my sample code
Yes, your approach works in my python shell

Trying to incorporate it, but the metadata is still being included. Check this out: I updated my fetch_and_read_document and it successfully added metadata fields to the excluded_llm_metadata_keys property. I also deleted all of my nodes in my local DB, so there's only 6 documents and 6 nodes in total.

But they're still being included in my chat
Attachments
image.png
image.png
hmm I have a feeling something about the transformations you have setup is not respecting the settings on the original input document
Try parsing the nodes outside of llama-index, setting the excluded keys, and then passing the nodes in directly

Plain Text
from llama_index.ingestion import IngestionPipeline

pipeline = IngestionPipeline(transformations=[...])
nodes = pipeline.run(documents=documents)
for node in nodes:
  node.excluded_llm_metadata_keys = [...]

index = VectorStoreIndex(nodes, service_context=service_context, storage_context=storage_context)
Not sure what you mean. DM'd you some more details if you can take a look? πŸ™
Fixed it! Thanks @Logan M!!

The issue was that I had my metadata extraction / transformations logic in my service_context, and was simply using service_context when building my index.. Moving this logic to where I parse and build my index, and then using the IngestionPipeline to build the nodes did the trick!
Plain Text
transformations = [
    node_parser,
    summary_extractor,
    ...
]

pipeline = IngestionPipeline(transformations=transformations)

for doc in conversation.documents:
    try:
        llama_index_doc = fetch_and_read_document(doc)
        logger.info(f"Adding doc {conversation.documents.index(doc)+1} of {len(conversation.documents)} to index")

        nodes = pipeline.run(
            documents=llama_index_doc,
            in_place=True,
            show_progress=True,
        )

        for node in nodes:
            node.metadata['patient_id'] = str(doc.patient_id)
            node.metadata['conversation_id'] = str(doc.conversation_id)
            node.excluded_llm_metadata_keys = ["page_label",
                                               ..."]

        index.insert_nodes(nodes)
        llama_index_docs.append(node)
Add a reply
Sign up and join the conversation on Discord