I am currently trying to do a basic task

HHiroki Kawai

I am currently trying to do a basic task of creating a chatbot on LlamaIndex that indexes a group of texts (multiple text files) and responds based on that information. However, when indexing, I want to divide the text into certain sizes, but when responding, I want to quote the original text longer (the entire contents of each text file). How can I do this?

23 comments

HHiroki Kawai

@Logan M
I thought I could achieve this by customizing the Response Synthesizer, is my understanding correct? It seems like it could be done by processing the full text fetched from a node using the customized Response Synthesizer, is that right?

LLogan M

If you add the name of each source document to the Metadata of each document, you can access that from response.source_nodes

For example

Plain Text

document.metadata["name"] = "name"

index = VectorStoreIndex.from_documents([document])

response = index.as_query_engine().query(query_str)

print(response.source_nodes[0].node.metadata["name"])

HHiroki Kawai

@Logan M Thank you for your reply! Is there a way to just retrieve the nodes of the queried document without generating a response? Because I want to retrieve the original document's text first and then generate a response.

LLogan M

You can use a custom node-postprocessor to modify the nodes before calling the llm (I.e. fetch the original document text)

Or, you can do this to just retrieve the nodes

Plain Text

nodes = index.as_retriever().retrieve(query_str)

HHiroki Kawai

@Logan M Thank you. It seems like I can achieve what I want to do with .as_retriever().retrieve(query_str), but I would like to use the custom node-postprocessor you mentioned to do it more smartly. However, I haven't been able to find an example of how to create a custom node-postprocessor. Could you please guide me? Of course, I would be most happy if it's an example of fetching the original document text.

LLogan M

Retrieving the original document text is quite tricky, since once documents are inserted, they are broken into chunks and the original document is slightly lost

The only way to recover it is to use index.ref_doc_info to get a mapping of each input doc id to a list of the node ids it created. Then you can use index.docstore.get_document(node_id) for every node ID, and then maybe you have the original document... very complex haha

LLogan M

Here's a link to an example I wrote for someone, as a node postprocessor

https://discord.com/channels/1059199217496772688/1124662099105292339/1124722929607901214

HHiroki Kawai

@Logan M Thank you. It seems a bit challenging, but I'll try to follow and try each step. Also, thank you for the example of the custom node postprocessor. I'll try to write one that retrieves the original text. Thank you again.

LLogan M

Hmm, One quick hack is also inserting the original document text into the Metadata.

Then, it will show up in the nodes metadata

You could do something like this for each document before creating the index

Plain Text

document.metadata = {"orig_text": document.text}
document.excluded_llm_metadata_keys = ["orig_text"]
document.excluded_embed_metadata_keys=["orig_text"]

The first line sets the metadata. The other two lines ensure the embed model and LLM don't have the entire document used as input to them 😅

HHiroki Kawai

Ah, I see. So, the document.metadata is carried over to each node when indexed, right? Indeed, if that's the case, it would be easy to retrieve from the metadata! Although the original sentence would be duplicated for each divided sentence of the document... It seems convenient.

HHiroki Kawai

@Logan M I tried your metadata solution, but I got following error.

Traceback (most recent call last):
  File "/Users/user/PycharmProjects/llm/src/llm/test_llamaindex.py", line 56, in <module>
    main()
  File "/Users/user/PycharmProjects/llm/src/llm/test_llamaindex.py", line 45, in main
    index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
  File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/indices/base.py", line 96, in from_documents
    nodes = service_context.node_parser.get_nodes_from_documents(
  File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/simple.py", line 91, in get_nodes_from_documents
    nodes = get_nodes_from_document(
  File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/node_utils.py", line 54, in get_nodes_from_document
    text_splits = get_text_splits_from_document(
  File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/node_parser/node_utils.py", line 34, in get_text_splits_from_document
    text_splits = text_splitter.split_text_with_overlaps(
  File "/Users/user/.pyenv/versions/miniforge3-4.10.1-5/envs/llm/lib/python3.10/site-packages/llama_index/langchain_helpers/text_splitter.py", line 161, in split_text_with_overlaps
    raise ValueError(
ValueError: Effective chunk size is non positive after considering metadata

HHiroki Kawai

@Logan M At first, I used SimpleDirectoryReader to load documents. Could that be the reason?

HHiroki Kawai

I checked
print(document.get_content(metadata_mode=MetadataMode.LLM))
and it prints full text so it looks like .excluded_llm_metadata_keys not worked.

HHiroki Kawai

'm sorry for the misunderstanding, it seems that the orig_text is indeed excluded.

HHiroki Kawai

However, I'm still getting the "ValueError: Effective chunk size is non positive after considering metadata."

LLogan M

Ah shoot, it looks like the node parser is using all the Metadata, instead of following the exclude rules 😅

LLogan M

I can patch this at some point today

LLogan M

Made a PR here, should merge soon
https://github.com/jerryjliu/llama_index/pull/6744

HHiroki Kawai

@Logan M Thank you!
And, I also have some questions about the node_postprocessors. I feel like the node_postprocessor is not working because when I define the node_postprocessors as follows, the print inside the postprocess_nodes function is not output during the query. Am I using it incorrectly?

class MetadataPostprocessor:
    """Metadata Node postprocessor."""
    print("MetadataPostprocessor")
    def postprocess_nodes(
        self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
    ) -> List[NodeWithScore]:
        """Postprocess nodes."""
        print("postprocess_nodes")
        return nodes
retriever = index.as_retriever()
query_engine = RetrieverQueryEngine.from_args(retriever, response_mode='compact', service_context=service_context, node_postprocessors=[MetadataPostprocessor()])
response = query_engine.query(query)

LLogan M

What version of llama index do you have? pip show llama-index

HHiroki Kawai

@Logan M I am using version 0.7.0

LLogan M

Ah. Just checked and that's my fault lol just merged a fix! It will be on the next release, or you can install from source to get it 🙏

HHiroki Kawai

@Logan M Thank you so much! I installed from source and it worked.
One issue I encountered was a TypeError: MetadataPostprocessor.postprocess_nodes() missing 1 required positional argument: 'query_bundle'. So, I removed 'query_bundle: Optional[QueryBundle]' from the arguments of postprocess_nodes. As a result, it seems to be working correctly.

Add a reply

Find answers from the community

I am currently trying to do a basic task