Find answers from the community

Updated 12 months ago

But I have a bug where `index.as_

At a glance
But I have a bug where index.as_retriever is retrieving nodes that are not in the index.

For example, I created an index that has 97 nodes, but when I do top_k=100, it returns 100 nodes.

Could use another set of eyes πŸ™πŸ˜…

Plain Text
Query:  What's the X-ray details?
Creating tool: xray
Index xray_f30dc4df-d3f1-4926-9820-31d5867d0aa1 has 97 nodes
Tool xray has the description: A set of XRAY medical record for patient **************, Date of Birth: **************, that were exported from the user's electronic medical record system.
Total Nodes count: 97
Total Nodes: {...}
Filtered Nodes count: 60
BM25 Nodes count: 60
BM25 Node 1 of 60...

base_retriever Index ID: xray_f30dc4df-d3f1-4926-9820-31d5867d0aa1
Base Nodes count: 100 # ---> HERE'S THE ISSUE
Base Node 1 of 100...
p
J
L
44 comments
Did you check the remaining 3 nodes? Where is it getting them from? πŸ˜„
when I saw that my retriever's accuracy was 0/15, I literally wanted to cry πŸ˜„
No idea. I've logged just about every line of code possible
I see you are using pgvectorstore

I think you need to either

a) make a new table for each index in pgvector
b) or properly insert nodes so that the index store is keeping track of what index has what nodes (example below)

Plain Text
rebuilt_nodes = [metadata_dict_to_node(vector.metadata, vector.text) for vector in topic_vectors]

# Create a new StorageContext for the rebuilt index
rebuilt_storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    fs=fs
)

rebuilt_index = VectorStoreIndex(nodes=nodes, service_context=service_context, storage_context=rebuilt_storage_context, store_nodes_override=True)

rebuilt_index.storage_context.persist(persist_dir=persist_dir, fs=fs)
actually, that might add nodes twice to the vector store (assuming they are already in there) :PSadge:
ok, other option
actually not sure -- this workflow is a little weird πŸ˜…
what I had above is pretty close
Step 1 is reliable at rebuilding indicies from an existing vector_store without re-inserting nodes twice. I use it in my production code today.

It's Step 3 that's bonkers to me. How does index.storage_context.docstore.docs result in 97 nodes, but index.as_retriever(**kwargs) fetch 100?
What I thought would be a hacky test has turned into this mess πŸ€ͺ
because its not retrieving from the docstore, its retrieving from the vector store πŸ€”
Normally, with remote integrations like pgvector, the docstore and index store isn't used. All the data is in the vector store

In the case where you have a vector store that doesn't store text (i.e. the default), it uses the index store to keep track of node ids that belong to the index
here, the index store is being missed entirely
but its pretty janky to incorporate with your current setup outside of the index -- essentially you need to rely on the base constructor VectorStoreIndex(nodes=nodes, ..., store_nodes_override=True) to populate all 3 stores
Is my proposed setup possible? To have 1 main index by patient, and then have N-many topic-based subindicies that all share the same vector_store?
Or is there a way to use a retreiver that filters on the docstore?
hmmm, there is a way to filter, but I have a feeling it will be slow for pgvector to be filtering
Plain Text
from llama_index.indices.vector_store.retrievers import (
    VectorIndexRetriever,
)

retriever = VectorIndexRetriever(
  index=index,
  node_ids=list(docstore.docs.keys()),
  similarity_top_k=5,
  filters=filters,
  ...
)
It still retrieves 100 nodes instead of 97
Plain Text
base_retriever = VectorIndexRetriever(
    index=index,
    node_ids=list(index.storage_context.docstore.docs.keys()),
    similarity_top_k=top_k, # set to 100 for testing
    filters=filters,
    # ...
    )
base_nodes = base_retriever.retrieve(query_str)
print(f"Base Nodes count: {len(base_nodes)}") # 100
then I'm really not sure what the issue is. Requires some intense debugging I think
This prompted another idea, which worked! But as you foreshadowed, it's un-usably slow: takes ~44 seconds (< 1s is normal) for each topic index
Plain Text
nodes = index.storage_context.docstore.docs
filtered_nodes = [node for node in nodes.values()
                if node.metadata.get('patient_id') == patient_id
                and node.metadata.get('node_type') == 'child']
print(f"Filtered Nodes count: {len(filtered_nodes)}") # 60
filters = MetadataFilters(
    filters=[MetadataFilter(key="node_id", value=str(node.node_id), operator=FilterOperator.EQ) for node in filtered_nodes],
    condition=FilterCondition.OR
)
kwargs = {"similarity_top_k": 100, "filters": filters}
base_retriever = index.as_retriever(**kwargs) # 60!!
yeaaaa lol glad it kind of works though
its because its filtering a column of json blobs
which is... not efficient lol
Did this approach last night, which is why I tried to just create a whole separate topic-based index instead :/
Is this something you create on the fly for each query?
Was hoping not to
yea makes sense
Creating the topic-based subindicies only takes a few seconds using the rebuilt_index approach
Was hoping to just use them directly once they're created
Didn't want to have to filter them again
what if you just used the in-memory vector store once you had the nodes for the rebuilt_index ?
Or, I guess that only works (quickly) if the pg vector store is also returning the node embeddings
If you had the nodes+embeddings, it would be very fast to create an in-memory vector index
That was the plan with Step 2

Rebuilding these 4 indicies and saving them to a dict takes ~4 seconds
Plain Text
tools_dict = {
        "xray": ['x-ray', 'xray', 'xr', 'radiograph', 'cxr', 'kub', 'axr', 'dxr', 'film'],
        "ct-scan": [' ct ', 'ct_', 'computed tomography', 'cat scan', 'ct scan'],
        "mri": ['mri', 'magnetic resonance imaging', 'nmr imaging', 'nmri'],
        "all": [],
    }
...
# Store the tool_to_index dictionary in the patient_to_tools dictionary
patient_to_tools[patient_id] = tool_to_index
I can access each topic_index like this patient_to_tools[patient_id]["xray"], but I can't create a query engine out of it because doing so retrieves against all of the patient's nodes, not just the ones saved in memory for the topic_index
Ultimately, my goal was to have a query agent pick a query engine tool that best corresponds to the user query, and in my case, I'm currently getting bad answers for certain topics like xrays (mostly retrieval issues). This was to be my hack workaround to this problem
Ok, might have gotten it working as originally intended. Will share updated solution soon
Solution for posterity:
Plain Text
async def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    top_k = 100
    patient_id = str(conversation_docs[0].patient_id)

    filters = MetadataFilters(
        filters=[
            ExactMatchFilter(key="patient_id", value=patient_id),
            ExactMatchFilter(key="node_type", value="child"),
        ]
    )
    kwargs = {"similarity_top_k": top_k, "filters": filters}

    # Pre-filter nodes since BM25Retriever doesn't support filters
    nodes = index.storage_context.docstore.docs # NOTE: 97 nodes in the index
    filtered_nodes = [node for node in nodes.values()
                    if node.metadata.get('patient_id') == patient_id
                    and node.metadata.get('node_type') == 'child'] # NOTE: 60 nodes that match the filters
    
    bm25_retriever = BM25Retriever.from_defaults(nodes=filtered_nodes, similarity_top_k=top_k)
    bm25_nodes = bm25_retriever.retrieve(query_str) # Returns 60 nodes (despite requesting 100), as expected

    index = VectorStoreIndex(filtered_nodes, service_context=service_context) # Creating a new index on the filtered nodes solves the issue!
    base_retriever = index.as_retriever(**kwargs)
    base_nodes = base_retriever.retrieve(query_str) # Returns 60 nodes!!!
Nice! Clever solution
Add a reply
Sign up and join the conversation on Discord