LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

In the query response you can get the

In the query response you can get the

At a glance

The community members are discussing how to retrieve the file or document that a source node came from in the query response of the llama-index library. They explore setting the extra_info of each document to contain the filename, and using the SimpleDirectoryReader with a custom file_metadata function. However, some community members encounter issues where the extra_info is not being properly stored or retrieved.

The root cause is eventually identified as the include_extra_info flag in the SimpleNodeParser being set to False, which prevents the extra information from being stored. Setting this flag to True resolves the issue for the community members.

Useful resources

·

In the query response, you can get the list of source_nodes. Is there a parameter for retrieving the file/document the source node came from?

L

O

53 comments

If you set the extra_info of each document to contain the filename, that will also show up in the source nodes

more details here

https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_documents.html

I followed this doc. All of the node.extra_info return None even when there's source text.

Works for me

Plain Text

>>> from llama_index import Document, GPTVectorStoreIndex
>>> doc = Document("this is some text", extra_info={'test_key': 'test_val'})
>>> index = GPTVectorStoreIndex.from_documents([doc])
>>> response = index.as_query_engine().query('hello world')
>>> response.source_nodes[0].node.extra_info
{'test_key': 'test_val'}
>>>

It also works if you set doc.extra_info directly

The instructions show creating a lambda function for filenames. It's not clear how you do that with the SimpleDirectoryReader.

I essentially did this:

Plain Text

from llama_index import SimpleDirectoryReader
filename_fn = lambda filename: {'file_name': filename}

# automatically sets the extra_info of each document according to filename_fn
documents = SimpleDirectoryReader('./data', file_metadata=filename_fn)

Almost!

Plain Text

>>> from llama_index import SimpleDirectoryReader
>>> filename_fn = lambda filename: {'file_name': filename}
>>> documents = SimpleDirectoryReader('./paul_graham', file_metadata=filename_fn).load_data()
>>> documents[0].extra_info
{'file_name': 'paul_graham/paul_graham_essay.txt'}
>>>

Should that be available in the source nodes in a query response?

I did this to create the index:

Plain Text

# Read in Documents
filename_fn = lambda filename: {'file_name': filename}
documents = []
print("Reading documents.")
for file_path in file_dirs:
    documents.extend(SimpleDirectoryReader(
        input_dir=file_path,
        file_metadata=filename_fn,
        recursive=True).load_data()
    )
        
print("Building index.")
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

Then, in the query, I do this:

Plain Text

evaluator = ResponseEvaluator(service_context=service_context)
response = query_engine.query(query)
return {
    "query": query,
    "response": str(response),
    "source_documents": [x.node.extra_info for x in response.source_nodes],
    "source_text": self._source_text(response.source_nodes),
    "evaluation": evaluator.evaluate_source_nodes(response)
}

But source_documents always shows [None, None, ...]

It should be 🤔 or at least it is for me

I will double check my sanity here. This should work lol

Thanks. I thought I was doing everything the same but with the source_documents portion added in.

hmmm yea it works for me in a test script 😅 Not sure what the difference is here...

Plain Text

from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex

filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader(
    input_dir="./paul_graham",
    file_metadata=filename_fn,
    recursive=True).load_data()

index = GPTVectorStoreIndex.from_documents(documents)

response = index.as_query_engine().query("what did the author do growing up?")

print(str(response))
print([x.node.extra_info for x in response.source_nodes])

Output

Plain Text

Growing up, the author wrote short stories, programmed on an IBM 1401, built a microcomputer with a Heathkit, wrote simple games and a word processor on a TRS-80, and studied philosophy in college.
[{'file_name': 'paul_graham/paul_graham_essay.txt'}, {'file_name': 'paul_graham/paul_graham_essay.txt'}]

Hmmm...I load the index after persisting it. Any chance that's an issue?

hmm, I will check, I'll add a save/load part to my test

added this before running the query, still works for me

Plain Text

index.storage_context.persist(persist_dir='./nodes_index')

from llama_index import StorageContext, load_index_from_storage
index = load_index_from_storage(StorageContext.from_defaults(persist_dir="./nodes_index"))

Maybe try with a fresh venv? Not sure why it's not working on your end 😅

Plain Text

python -m venv venv
source venv/bin/activate
pip install llama-index

ok. Will do

I have a query function that does this:

Plain Text

index_dir = os.path.join(self.indexes_dir, index_id)

# Load index from requested docs
storage_context = StorageContext.from_defaults(persist_dir=index_dir)
service_context = self.create_service_context(**kwargs)
index = load_index_from_storage(
    storage_context=storage_context,
    service_context=service_context,
)
query_engine = index.as_query_engine()
responses = [self._query(x, query_engine, service_context) for x in queries]

the _query function looks like this:

Plain Text

evaluator = ResponseEvaluator(service_context=service_context)
response = query_engine.query(query)
return {
    "query": query,
    "response": str(response),
    "source_documents": [x.node.extra_info for x in response.source_nodes],
    "source_text": self._source_text(response.source_nodes),
    "evaluation": evaluator.evaluate_source_nodes(response)
}

Do you see anything wrong?

Nah that looks right to me. And no extra_info I'm guessing?

Actually, we can confirm that the documents were ingested properly. If you run nodes = index.docstore.docs) it will get a list of every node in the index.

From there, you can verify that the nodes look correct

oh good. that was my next question

They're all showing as None Is there a method to look at the files to see if the data is there but not being ingested properly (vs it not being stored in the first place)?

Actually. Just looked in the docstore.json. It's all None there too

When you call from_documents(), are you 100% sure each document has an extra_info field filled in?

Sounds like it might not be for some reason

I just did this for 1 doc. Let me show what it prints

Here is the output. It looks like after from_documents extra_info disappears

bruh how is this possible 😅

why can't I replicate this...

Here is the code matching up to those print statements:

Plain Text

print("Printing documents...")
pprint(documents)
        
print("Building index.")
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

print("Printing Nodes")
pprint(index.docstore.docs)

I know I've asked before, but you are super sure you have a recent version of llama-index? pip show llama-index can check that

I will run exactly this as a sanity check lol

Attachment

So, I'm running this inside a class. Any chance there's some global variable or something getting nerfed?

mmm I dont think so... what does your service_context look like again?

Why does it work for me lol

Plain Text

>>> from llama_index import Document, GPTVectorStoreIndex
>>> documents = [Document('text', extra_info={'test': 'val'})]
>>> documents[0]
Document(text='text', doc_id='03b4c6e9-2bd2-4687-8980-f388eeebd6d7', embedding=None, doc_hash='1d3f05b1647ad55d6c09b356fe5d1fe670be262d5c3ea0ccda070e365a94809b', extra_info={'test': 'val'})
>>> index = GPTVectorStoreIndex.from_documents(documents)
>>> print(index.docstore.docs)
{'faf195d4-1295-425b-acb9-4289dcbc1c33': Node(text='text', doc_id='faf195d4-1295-425b-acb9-4289dcbc1c33', embedding=None, doc_hash='1d3f05b1647ad55d6c09b356fe5d1fe670be262d5c3ea0ccda070e365a94809b', extra_info={'test': 'val'}, node_info={'start': 0, 'end': 4, '_node_type': <NodeType.TEXT: '1'>}, relationships={<DocumentRelationship.SOURCE: '1'>: '03b4c6e9-2bd2-4687-8980-f388eeebd6d7'})}
>>>

OH

omg

node_parser = SimpleNodeParser(text_splitter=splitter, include_extra_info=False, ...

there it is

the culprit

oh jeesh

set that bad boy to True lol

I copy pasted some shit from somewhere

jeesh

sanity restored !

I'll just delete it

Thank you for working through my stupidity. I don't even understand why that's a flag.

or where I copied someone setting it to false

No idea why that's a flag either haha glad we figured it out though!

I had started diving into the code that the lambda is called through and was like "IT"S JUST extra_info = str(filepath) WHY IS IT VANISHING!!!"

Add a reply

Sign up and join the conversation on Discord