Did you check the remaining 3 nodes? Where is it getting them from? π
when I saw that my retriever's accuracy was 0/15, I literally wanted to cry π
No idea. I've logged just about every line of code possible
I see you are using pgvectorstore
I think you need to either
a) make a new table for each index in pgvector
b) or properly insert nodes so that the index store is keeping track of what index has what nodes (example below)
rebuilt_nodes = [metadata_dict_to_node(vector.metadata, vector.text) for vector in topic_vectors]
# Create a new StorageContext for the rebuilt index
rebuilt_storage_context = StorageContext.from_defaults(
vector_store=vector_store,
fs=fs
)
rebuilt_index = VectorStoreIndex(nodes=nodes, service_context=service_context, storage_context=rebuilt_storage_context, store_nodes_override=True)
rebuilt_index.storage_context.persist(persist_dir=persist_dir, fs=fs)
actually, that might add nodes twice to the vector store (assuming they are already in there) :PSadge:
actually not sure -- this workflow is a little weird π
what I had above is pretty close
Step 1 is reliable at rebuilding indicies from an existing vector_store without re-inserting nodes twice. I use it in my production code today.
It's Step 3 that's bonkers to me. How does index.storage_context.docstore.docs
result in 97 nodes, but index.as_retriever(**kwargs)
fetch 100?
What I thought would be a hacky test has turned into this mess π€ͺ
because its not retrieving from the docstore, its retrieving from the vector store π€
Normally, with remote integrations like pgvector, the docstore and index store isn't used. All the data is in the vector store
In the case where you have a vector store that doesn't store text (i.e. the default), it uses the index store to keep track of node ids that belong to the index
here, the index store is being missed entirely
but its pretty janky to incorporate with your current setup outside of the index -- essentially you need to rely on the base constructor VectorStoreIndex(nodes=nodes, ..., store_nodes_override=True)
to populate all 3 stores
Is my proposed setup possible? To have 1 main index by patient, and then have N-many topic-based subindicies that all share the same vector_store?
Or is there a way to use a retreiver that filters on the docstore?
hmmm, there is a way to filter, but I have a feeling it will be slow for pgvector to be filtering
from llama_index.indices.vector_store.retrievers import (
VectorIndexRetriever,
)
retriever = VectorIndexRetriever(
index=index,
node_ids=list(docstore.docs.keys()),
similarity_top_k=5,
filters=filters,
...
)
It still retrieves 100 nodes instead of 97
base_retriever = VectorIndexRetriever(
index=index,
node_ids=list(index.storage_context.docstore.docs.keys()),
similarity_top_k=top_k, # set to 100 for testing
filters=filters,
# ...
)
base_nodes = base_retriever.retrieve(query_str)
print(f"Base Nodes count: {len(base_nodes)}") # 100
then I'm really not sure what the issue is. Requires some intense debugging I think
This prompted another idea, which worked! But as you foreshadowed, it's un-usably slow: takes ~44 seconds (< 1s is normal) for each topic index
nodes = index.storage_context.docstore.docs
filtered_nodes = [node for node in nodes.values()
if node.metadata.get('patient_id') == patient_id
and node.metadata.get('node_type') == 'child']
print(f"Filtered Nodes count: {len(filtered_nodes)}") # 60
filters = MetadataFilters(
filters=[MetadataFilter(key="node_id", value=str(node.node_id), operator=FilterOperator.EQ) for node in filtered_nodes],
condition=FilterCondition.OR
)
kwargs = {"similarity_top_k": 100, "filters": filters}
base_retriever = index.as_retriever(**kwargs) # 60!!
yeaaaa lol glad it kind of works though
its because its filtering a column of json blobs
which is... not efficient lol
Did this approach last night, which is why I tried to just create a whole separate topic-based index instead :/
Is this something you create on the fly for each query?
Creating the topic-based subindicies only takes a few seconds using the rebuilt_index
approach
Was hoping to just use them directly once they're created
Didn't want to have to filter them again
what if you just used the in-memory vector store once you had the nodes for the rebuilt_index
?
Or, I guess that only works (quickly) if the pg vector store is also returning the node embeddings
If you had the nodes+embeddings, it would be very fast to create an in-memory vector index
That was the plan with Step 2
Rebuilding these 4 indicies and saving them to a dict takes ~4 seconds
tools_dict = {
"xray": ['x-ray', 'xray', 'xr', 'radiograph', 'cxr', 'kub', 'axr', 'dxr', 'film'],
"ct-scan": [' ct ', 'ct_', 'computed tomography', 'cat scan', 'ct scan'],
"mri": ['mri', 'magnetic resonance imaging', 'nmr imaging', 'nmri'],
"all": [],
}
...
# Store the tool_to_index dictionary in the patient_to_tools dictionary
patient_to_tools[patient_id] = tool_to_index
I can access each topic_index like this patient_to_tools[patient_id]["xray"]
, but I can't create a query engine out of it because doing so retrieves against all of the patient's nodes, not just the ones saved in memory for the topic_index
Ultimately, my goal was to have a query agent pick a query engine tool that best corresponds to the user query, and in my case, I'm currently getting bad answers for certain topics like xrays (mostly retrieval issues). This was to be my hack workaround to this problem
Ok, might have gotten it working as originally intended. Will share updated solution soon
Solution for posterity:
async def index_to_query_engine(conversation_docs: List[str], index: VectorStoreIndex) -> BaseQueryEngine:
top_k = 100
patient_id = str(conversation_docs[0].patient_id)
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="patient_id", value=patient_id),
ExactMatchFilter(key="node_type", value="child"),
]
)
kwargs = {"similarity_top_k": top_k, "filters": filters}
# Pre-filter nodes since BM25Retriever doesn't support filters
nodes = index.storage_context.docstore.docs # NOTE: 97 nodes in the index
filtered_nodes = [node for node in nodes.values()
if node.metadata.get('patient_id') == patient_id
and node.metadata.get('node_type') == 'child'] # NOTE: 60 nodes that match the filters
bm25_retriever = BM25Retriever.from_defaults(nodes=filtered_nodes, similarity_top_k=top_k)
bm25_nodes = bm25_retriever.retrieve(query_str) # Returns 60 nodes (despite requesting 100), as expected
index = VectorStoreIndex(filtered_nodes, service_context=service_context) # Creating a new index on the filtered nodes solves the issue!
base_retriever = index.as_retriever(**kwargs)
base_nodes = base_retriever.retrieve(query_str) # Returns 60 nodes!!!