Find answers from the community

Updated last year

What is the best document processing strategy?

@kapa.ai What is the best document processing strategy?
k
V
14 comments
Please stand by @Vimwork1 as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
The best document processing strategy in the context of LlamaIndex involves understanding your data, identifying issues and corner cases, and using LLMs to extract contextual information relevant to the document. This helps in better retrieval and disambiguation of similar-looking passages, especially in long documents.

Here are some key steps:

  1. Understanding Your Data: Start with a discovery phase of understanding your data. Identify issues and corner cases as you interact with the system. This is part of the development pathway of an LLM application. Source ()
  1. Metadata Extraction: Use LLMs to extract certain contextual information relevant to the document. This helps the retrieval and language models disambiguate similar-looking passages. An example of this process can be found in this notebook (). Source ()
  1. Customizing Metadata Format: Metadata is injected into the actual text of each document/node when sent to the LLM or embedding model. The format of this metadata is controlled by three attributes: Document.metadata_seperator, Document.metadata_template, and Document.text_template. Source ()
Remember, the best strategy may vary depending on the specific requirements of your application and the nature of your data.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai How to debug source documents and nodes text chosen for the response?
Let me check the knowledge base for you. One moment! :books:
----------------------
To debug the source documents and nodes text chosen for the response, you can use the response_mode parameter in the as_query_engine method. Setting response_mode to "tree_summarize" will provide more detailed outputs, including the source nodes that contributed to the response.

Here's an example of how to do this:

Plain Text
# set Logging to DEBUG for more detailed outputs
query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("What did the author do growing up?")


You can then display the response:

Plain Text
display(Markdown(f"<b>{response}</b>"))


To get the source text of the first node in the response, you can use:

Plain Text
print(response.source_nodes[0].source_text)


This will print the source text of the first node that contributed to the response.

For more detailed debugging, you can use different query modes such as "svm", "linear_regression", and "logistic_regression". Here's an example:

Plain Text
query_modes = [
    "svm",
    "linear_regression",
    "logistic_regression",
]
for query_mode in query_modes:
    # set Logging to DEBUG for more detailed outputs
    query_engine = index.as_query_engine(vector_store_query_mode=query_mode)
    response = query_engine.query("What did the author do growing up?")
    print(f"Query mode: {query_mode}")
    display(Markdown(f"<b>{response}</b>"))


This will print the response for each query mode, allowing you to compare the results and debug accordingly.

Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to save metadata with source nodes?
Let me check the knowledge base for you. One moment! :books:
----------------------
In LlamaIndex, you can save metadata with source nodes in a couple of ways:

  1. During Node Creation: When you create a node, you can include metadata. This metadata will be associated with the node and can be used later for various purposes such as retrieval or response generation. Here's an example of how to do this:
Plain Text
from llama_index.schema import TextNode, NodeRelationship, RelatedNodeInfo

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")
# set relationships
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=node2.node_id)
node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(node_id=node1.node_id)
nodes = [node1, node2]


In the above example, the RelatedNodeInfo class can also store additional metadata if needed:

Plain Text
node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(node_id=node1.node_id, metadata={"key": "val"})


  1. During Document Creation: Any information set in the metadata dictionary of each document will show up in the metadata of each source node created from the document. Here's an example of how to do this:
Plain Text
document = Document(
    text='text', 
    metadata={
        'filename': '<doc_file_name>', 
        'category': '<category>'
    }
)


In this example, the metadata dictionary includes a filename and category. This metadata will be associated with each source node created from the document.

Source ()
Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai How to improve chances that links between source wiki documents will be used to retrive the right nodes?
Let me check the knowledge base for you. One moment! :books:
----------------------
Improving the chances that links between source wiki documents will be used to retrieve the right nodes can be achieved by effectively utilizing the relationships attribute of the Document and Node objects in LlamaIndex. This attribute is a dictionary that contains relationships to other Documents/Nodes.

Here's an example of how to set relationships between nodes:

Plain Text
from llama_index.schema import TextNode, NodeRelationship, RelatedNodeInfo

node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")
# set relationships
node1.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=node2.node_id)
node2.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(node_id=node1.node_id)
nodes = [node1, node2]


In this example, node1 is set to be the next node of node2, and node2 is set to be the previous node of node1. This kind of relationship setting can be used to represent links between wiki documents.

Additionally, you can store additional metadata in the RelatedNodeInfo class if needed:

Plain Text
node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(node_id=node1.node_id, metadata={"key": "val"})


In this example, node2 is set to be the child of node1, and additional metadata is stored in the relationship.

By effectively setting up these relationships and metadata, you can improve the chances that links between source wiki documents will be used to retrieve the right nodes.

Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
@kapa.ai how to print names of all properties of a document
Let me check the knowledge base for you. One moment! :books:
----------------------
To print the names of all properties of a Document object in LlamaIndex, you can use Python's built-in dir() function or vars() function. Here's how you can do it:

Plain Text
from llama_index import Document

# Create a document
document = Document(text="This is a document")

# Get all properties
properties = dir(document)

# Print all properties
for property in properties:
    print(property)


This will print the names of all properties (including methods) of the Document object. If you want to get only the instance variables, you can use the vars() function:

Plain Text
# Get instance variables
variables = vars(document)

# Print instance variables
for variable in variables:
    print(variable)


This will print the names of all instance variables of the Document object.

Please note that these methods will give you the names of all properties or instance variables, not just the ones you've set. If you want to get only the properties you've set, you can check the metadata attribute of the Document object, which is a dictionary that contains the metadata you've set.

Source ()
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Add a reply
Sign up and join the conversation on Discord