@Logan M
I thought I could achieve this by customizing the Response Synthesizer, is my understanding correct? It seems like it could be done by processing the full text fetched from a node using the customized Response Synthesizer, is that right?
If you add the name of each source document to the Metadata of each document, you can access that from
response.source_nodes
For example
document.metadata["name"] = "name"
index = VectorStoreIndex.from_documents([document])
response = index.as_query_engine().query(query_str)
print(response.source_nodes[0].node.metadata["name"])
@Logan M Thank you for your reply! Is there a way to just retrieve the nodes of the queried document without generating a response? Because I want to retrieve the original document's text first and then generate a response.
You can use a custom node-postprocessor to modify the nodes before calling the llm (I.e. fetch the original document text)
Or, you can do this to just retrieve the nodes
nodes = index.as_retriever().retrieve(query_str)
@Logan M Thank you. It seems like I can achieve what I want to do with .as_retriever().retrieve(query_str), but I would like to use the custom node-postprocessor you mentioned to do it more smartly. However, I haven't been able to find an example of how to create a custom node-postprocessor. Could you please guide me? Of course, I would be most happy if it's an example of fetching the original document text.
Retrieving the original document text is quite tricky, since once documents are inserted, they are broken into chunks and the original document is slightly lost
The only way to recover it is to use index.ref_doc_info to get a mapping of each input doc id to a list of the node ids it created. Then you can use index.docstore.get_document(node_id) for every node ID, and then maybe you have the original document... very complex haha
@Logan M Thank you. It seems a bit challenging, but I'll try to follow and try each step. Also, thank you for the example of the custom node postprocessor. I'll try to write one that retrieves the original text. Thank you again.
Hmm, One quick hack is also inserting the original document text into the Metadata.
Then, it will show up in the nodes metadata
You could do something like this for each document before creating the index
document.metadata = {"orig_text": document.text}
document.excluded_llm_metadata_keys = ["orig_text"]
document.excluded_embed_metadata_keys=["orig_text"]
The first line sets the metadata. The other two lines ensure the embed model and LLM don't have the entire document used as input to them π
Ah, I see. So, the document.metadata is carried over to each node when indexed, right? Indeed, if that's the case, it would be easy to retrieve from the metadata! Although the original sentence would be duplicated for each divided sentence of the document... It seems convenient.
@Logan M I tried your metadata solution, but I got following error.
Traceback (most recent call last):
File "/Users/user/PycharmProjects/llm/src/llm/test_llamaindex.py", line 56, in <module>
main()
File "/Users/user/PycharmProjects/llm/src/llm/test_llamaindex.py", line 45, in main
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/indices/base.py", line 96, in from_documents
nodes = service_context.node_parser.get_nodes_from_documents(
File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/simple.py", line 91, in get_nodes_from_documents
nodes = get_nodes_from_document(
File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/node_utils.py", line 54, in get_nodes_from_document
text_splits = get_text_splits_from_document(
File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/node_utils.py", line 34, in get_text_splits_from_document
text_splits = text_splitter.split_text_with_overlaps(
File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/langchain_helpers/text_splitter.py", line 161, in split_text_with_overlaps
raise ValueError(
ValueError: Effective chunk size is non positive after considering metadata
@Logan M At first, I used SimpleDirectoryReader to load documents. Could that be the reason?
I checked
print(document.get_content(metadata_mode=MetadataMode.LLM))
and it prints full text so it looks like .excluded_llm_metadata_keys not worked.
'm sorry for the misunderstanding, it seems that the orig_text is indeed excluded.
However, I'm still getting the "ValueError: Effective chunk size is non positive after considering metadata."
Ah shoot, it looks like the node parser is using all the Metadata, instead of following the exclude rules π
I can patch this at some point today
@Logan M Thank you!
And, I also have some questions about the node_postprocessors. I feel like the node_postprocessor is not working because when I define the node_postprocessors as follows, the print inside the postprocess_nodes function is not output during the query. Am I using it incorrectly?
class MetadataPostprocessor:
"""Metadata Node postprocessor."""
print("MetadataPostprocessor")
def postprocess_nodes(
self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
) -> List[NodeWithScore]:
"""Postprocess nodes."""
print("postprocess_nodes")
return nodes
retriever = index.as_retriever()
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode='compact', service_context=service_context, node_postprocessors=[MetadataPostprocessor()])
response = query_engine.query(query)
What version of llama index do you have? pip show llama-index
@Logan M I am using version 0.7.0
Ah. Just checked and that's my fault lol just merged a fix! It will be on the next release, or you can install from source to get it π
@Logan M Thank you so much! I installed from source and it worked.
One issue I encountered was a TypeError: MetadataPostprocessor.postprocess_nodes() missing 1 required positional argument: 'query_bundle'. So, I removed 'query_bundle: Optional[QueryBundle]' from the arguments of postprocess_nodes. As a result, it seems to be working correctly.