But I have a bug where `index.as

At a glance

But I have a bug where index.as_retriever is retrieving nodes that are not in the index.

For example, I created an index that has 97 nodes, but when I do top_k=100, it returns 100 nodes.

Could use another set of eyes 🙏😅

Plain Text

Query:  What's the X-ray details?
Creating tool: xray
Index xray_f30dc4df-d3f1-4926-9820-31d5867d0aa1 has 97 nodes
Tool xray has the description: A set of XRAY medical record for patient **************, Date of Birth: **************, that were exported from the user's electronic medical record system.
Total Nodes count: 97
Total Nodes: {...}
Filtered Nodes count: 60
BM25 Nodes count: 60
BM25 Node 1 of 60...

base_retriever Index ID: xray_f30dc4df-d3f1-4926-9820-31d5867d0aa1
Base Nodes count: 100 # ---> HERE'S THE ISSUE
Base Node 1 of 100...

44 comments

ppikachu8887867

Did you check the remaining 3 nodes? Where is it getting them from? 😄

ppikachu8887867

when I saw that my retriever's accuracy was 0/15, I literally wanted to cry 😄

JJoshhhh

No idea. I've logged just about every line of code possible

LLogan M

I see you are using pgvectorstore

I think you need to either

a) make a new table for each index in pgvector
b) or properly insert nodes so that the index store is keeping track of what index has what nodes (example below)

Plain Text

rebuilt_nodes = [metadata_dict_to_node(vector.metadata, vector.text) for vector in topic_vectors]

# Create a new StorageContext for the rebuilt index
rebuilt_storage_context = StorageContext.from_defaults(
    vector_store=vector_store,
    fs=fs
)

rebuilt_index = VectorStoreIndex(nodes=nodes, service_context=service_context, storage_context=rebuilt_storage_context, store_nodes_override=True)

rebuilt_index.storage_context.persist(persist_dir=persist_dir, fs=fs)

LLogan M

actually, that might add nodes twice to the vector store (assuming they are already in there) :PSadge:

LLogan M

ok, other option

LLogan M

actually not sure -- this workflow is a little weird 😅

LLogan M

what I had above is pretty close

JJoshhhh

Step 1 is reliable at rebuilding indicies from an existing vector_store without re-inserting nodes twice. I use it in my production code today.

It's Step 3 that's bonkers to me. How does index.storage_context.docstore.docs result in 97 nodes, but index.as_retriever(**kwargs) fetch 100?

JJoshhhh

What I thought would be a hacky test has turned into this mess 🤪

LLogan M

because its not retrieving from the docstore, its retrieving from the vector store 🤔

LLogan M

Normally, with remote integrations like pgvector, the docstore and index store isn't used. All the data is in the vector store

In the case where you have a vector store that doesn't store text (i.e. the default), it uses the index store to keep track of node ids that belong to the index

LLogan M

here, the index store is being missed entirely

LLogan M

but its pretty janky to incorporate with your current setup outside of the index -- essentially you need to rely on the base constructor VectorStoreIndex(nodes=nodes, ..., store_nodes_override=True) to populate all 3 stores

JJoshhhh

Interesting

JJoshhhh

Is my proposed setup possible? To have 1 main index by patient, and then have N-many topic-based subindicies that all share the same vector_store?

JJoshhhh

Or is there a way to use a retreiver that filters on the docstore?

LLogan M

hmmm, there is a way to filter, but I have a feeling it will be slow for pgvector to be filtering

LLogan M

Plain Text

from llama_index.indices.vector_store.retrievers import (
    VectorIndexRetriever,
)

retriever = VectorIndexRetriever(
  index=index,
  node_ids=list(docstore.docs.keys()),
  similarity_top_k=5,
  filters=filters,
  ...
)

JJoshhhh

It still retrieves 100 nodes instead of 97

Plain Text

base_retriever = VectorIndexRetriever(
    index=index,
    node_ids=list(index.storage_context.docstore.docs.keys()),
    similarity_top_k=top_k, # set to 100 for testing
    filters=filters,
    # ...
    )
base_nodes = base_retriever.retrieve(query_str)
print(f"Base Nodes count: {len(base_nodes)}") # 100

LLogan M

:PSadge:

LLogan M

then I'm really not sure what the issue is. Requires some intense debugging I think

JJoshhhh

This prompted another idea, which worked! But as you foreshadowed, it's un-usably slow: takes ~44 seconds (< 1s is normal) for each topic index

Plain Text

nodes = index.storage_context.docstore.docs
filtered_nodes = [node for node in nodes.values()
                if node.metadata.get('patient_id') == patient_id
                and node.metadata.get('node_type') == 'child']
print(f"Filtered Nodes count: {len(filtered_nodes)}") # 60
filters = MetadataFilters(
    filters=[MetadataFilter(key="node_id", value=str(node.node_id), operator=FilterOperator.EQ) for node in filtered_nodes],
    condition=FilterCondition.OR
)
kwargs = {"similarity_top_k": 100, "filters": filters}
base_retriever = index.as_retriever(**kwargs) # 60!!

LLogan M

yeaaaa lol glad it kind of works though

LLogan M

its because its filtering a column of json blobs

LLogan M

which is... not efficient lol

JJoshhhh

Did this approach last night, which is why I tried to just create a whole separate topic-based index instead :/

LLogan M

Is this something you create on the fly for each query?

or nah?

Was hoping not to

yea makes sense

Creating the topic-based subindicies only takes a few seconds using the rebuilt_index approach

JJoshhhh

Was hoping to just use them directly once they're created

JJoshhhh

Didn't want to have to filter them again

LLogan M

what if you just used the in-memory vector store once you had the nodes for the rebuilt_index ?

LLogan M

Or, I guess that only works (quickly) if the pg vector store is also returning the node embeddings

LLogan M

If you had the nodes+embeddings, it would be very fast to create an in-memory vector index

JJoshhhh

That was the plan with Step 2

Rebuilding these 4 indicies and saving them to a dict takes ~4 seconds

Plain Text

tools_dict = {
        "xray": ['x-ray', 'xray', 'xr', 'radiograph', 'cxr', 'kub', 'axr', 'dxr', 'film'],
        "ct-scan": [' ct ', 'ct_', 'computed tomography', 'cat scan', 'ct scan'],
        "mri": ['mri', 'magnetic resonance imaging', 'nmr imaging', 'nmri'],
        "all": [],
    }
...
# Store the tool_to_index dictionary in the patient_to_tools dictionary
patient_to_tools[patient_id] = tool_to_index

JJoshhhh

I can access each topic_index like this patient_to_tools[patient_id]["xray"], but I can't create a query engine out of it because doing so retrieves against all of the patient's nodes, not just the ones saved in memory for the topic_index

JJoshhhh

Ultimately, my goal was to have a query agent pick a query engine tool that best corresponds to the user query, and in my case, I'm currently getting bad answers for certain topics like xrays (mostly retrieval issues). This was to be my hack workaround to this problem

JJoshhhh

Ok, might have gotten it working as originally intended. Will share updated solution soon

LLogan M

Nice!

JJoshhhh

Solution for posterity:

Plain Text

async def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
    top_k = 100
    patient_id = str(conversation_docs[0].patient_id)

    filters = MetadataFilters(
        filters=[
            ExactMatchFilter(key="patient_id", value=patient_id),
            ExactMatchFilter(key="node_type", value="child"),
        ]
    )
    kwargs = {"similarity_top_k": top_k, "filters": filters}

    # Pre-filter nodes since BM25Retriever doesn't support filters
    nodes = index.storage_context.docstore.docs # NOTE: 97 nodes in the index
    filtered_nodes = [node for node in nodes.values()
                    if node.metadata.get('patient_id') == patient_id
                    and node.metadata.get('node_type') == 'child'] # NOTE: 60 nodes that match the filters
    
    bm25_retriever = BM25Retriever.from_defaults(nodes=filtered_nodes, similarity_top_k=top_k)
    bm25_nodes = bm25_retriever.retrieve(query_str) # Returns 60 nodes (despite requesting 100), as expected

    index = VectorStoreIndex(filtered_nodes, service_context=service_context) # Creating a new index on the filtered nodes solves the issue!
    base_retriever = index.as_retriever(**kwargs)
    base_nodes = base_retriever.retrieve(query_str) # Returns 60 nodes!!!

LLogan M

Nice! Clever solution

Add a reply

Find answers from the community

But I have a bug where `index.as_