Find answers from the community

Updated 3 months ago

## Question regarding node

Question regarding node postprocessing and window:


1) Can the node parser "window" function be performed on nodes, or only documents? I have run the operation on nodes, however the "window" only includes the text from the current node. (would this require a custom parser?)
2) When running consecutive PostProcessing functions, can the 'window' text be considered in ReRankers, rather than the original 'text'?


IE I would like to process as follows:

1) docsplitter = CustomJSONNodeParser (which results in one node for each segment with the text / start / end / speaker)
2) WindowNodeParser = Include x "text" from surrounding nodes

1) Retrieve topk 10 nodes
2) Consider the node text to be the "window" metadata
3) GPT rerank based on the "window" metadata

Plain Text
Example of how the nodes are restructurred now using 

from llama_index.core.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="text",
)

base_nodes = node_parser.get_nodes_from_documents(md_nodes)
L
c
13 comments
you can definitely rerank based on whatever text you want, if you implement a custom node postprocessor πŸ™‚ (Its very straightforward)
https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/root.html#id2
I ended up implementing a ccustom window processor for the JSON nodes:

Plain Text
 def create_window_for_nodes_simplified(md_nodes, x):
    md_nodes_window = []  # Initialize the new list to store nodes with updated windows

    for i, node in enumerate(md_nodes):
        # Calculate the range of indices for surrounding nodes
        start_idx = max(0, i - x)
        end_idx = min(len(md_nodes), i + x + 1)

        # Extract the text from the surrounding nodes
        window_texts = [md_nodes[j].text for j in range(start_idx, end_idx)]

        # Create a window text by concatenating the texts of the surrounding nodes
        window_text = ' '.join(window_texts)

        # Create a new node with updated metadata
        new_node = node  # Assuming shallow copy; adjust if deep copy is needed
        new_node.metadata['window'] = window_text

        # Add the updated node to the new list
        md_nodes_window.append(new_node)

    return md_nodes_window

# Example usage:
# Assuming md_nodes is your list of TextNode objects and x is the number of surrounding nodes to include
x = 1
md_nodes_window = create_window_for_nodes_simplified(md_nodes, x)

# Debug: Print out the 'window' for each node in the modified list for verification
for node in md_nodes_window:
    print(f"ID: {node.id_}, Window: {node.metadata['window']}") 
Should probably use the "relationship" ID if the nodes get out of order but...seems to at least solve the first step for now.
If there's a much better way to do this lmk, and really appreciate all of the help. It's a lot to wrap my tiny head around.
Sorry @Logan M You are so helpful, I just have one more question that would prevent me from scratching my head for another 20 minutes.

I am using docstore and instantiating my vector index with

Plain Text
vector_index = VectorStoreIndex(md_nodes_window, storage_context=storage_context)
)


in my retriever I am referencing
Plain Text
    retriever = VectorIndexRetriever(
        index=vector_index,
        similarity_top_k=vector_top_k,
    )

If I update the base nodes (say to increase the window)

When i update my storage_context alone, it does not seem to take until I reinstantiate vector_index.

For testing, how can I update the nodes / docstore / storage context without updating vector_index?
Hmm, you are modifying the metadata of nodes right, not the actual node.text contents? And you are using the base vector store or some other integration?
Correct, I'm adding a new metadata field 'window' and including the concatenated text there. The original TEXT is unaltered. It does seem to work now with the reranker. where it is using the 'window' as the node content
Attachment
Z.png
However, it seems like I have to rebuild my vector index whenever I change md_nodes_window in order for the results to be reflected.
I re-run the following:

Plain Text
##Add to docstore

from llama_index.core.storage.docstore import SimpleDocumentStore

docstore = SimpleDocumentStore()
docstore.add_documents(md_nodes_window)

Plain Text
storage_context = StorageContext.from_defaults(docstore=docstore)


But this has no effect until I re-run

vector_index = VectorStoreIndex(md_nodes_window, storage_context=storage_context)
Right, instead, you should modify the nodes already in the docstore

index.docstore.docs returns an ID->node dictionary

You can use index.docstore.add_documents(nodes) to insert your modified nodes. This will overwrite assuming the node ids are the same
Thank you so much...
Add a reply
Sign up and join the conversation on Discord