Find answers from the community

Updated 2 months ago

Hi I have two sets of nodes and want to

Hi, I have two sets of nodes and want to find corresponding relationship between the two sets by the similarity between nodes' distances. I created two VectorStoreIndexes and persisted them in the same ./storage. My first attempt is (try) to retrieve all nodes from index1, and use each node from index1 to query index2, and try to take or pass from the results. I am stuck on the step to "retrieve all nodes from index1" that seems should have been straightforward. I even tried to get it from index1.vector_store._data.text_id_to_ref_doc_id.keys() however that seem to contain all nodes in index1 and index2. do you have any suggestion here?
L
e
16 comments
Yea if you persist to the same storage, the docstore/vectorstore will have ALL nodes, but the index store is keeping track of which node ids belong to each index

If you just want to calculate the distance between nodes, no need to use an index

Plain Text
from llama_index.embeddings import OpenAIEmbedding
from llama_index.embeddings.base import similarity

embed_model = OpenAIEmbedding()
embed_1 = embed_model.get_text_embedding("text")
emebd_2 = embed_model.get_text_embedding("text")

score = similarity(embed_1, embed_2)
Otherwise, try persisting to different directories and use index.docstore.docs to get the nodes from each index
Problem is these nodes aren't simple texts, they are more like a JSON with properties and some texts, so I hope llama index has something ready to be helpful.
Embeddings only work with text, so you need someway to convert it to meangingful textual representation

The default json loader does an ok-ish job of this, and you can still use a similar approach

Plain Text
docs1 = SimpleDirectoryReader("./docs_1").load_data()
docs2 = SimpleDirectoryReader("./docs_2").load_data()

embed_1 = embed_model.get_text_embedding(docs1[0].get_content())
embed_2 = embed_model.get_text_embedding(docs2[0].get_content())
score = similarity(embed_1, embed_2)
The score(s) were not what we are after. There are downstream tasks after this step, so I'm going to try persisting into different directories. Thanks Logan.
@Logan M I played with my setup that persists two VectorStoreIndex into 1 directory, and find the retriever.retrieve("text") often times throw exception (pasted in a second message).

It turns out that retriever.retrieve actually calls .query() on the underlying ._vector_store, which according to what you told above, might have been matching across all nodes, causing found nodes not in the index passed during constructing retriever. Is this a bug?

lama_index/indices/vector_store/retrievers/retriever.py:153, in VectorIndexRetriever._get_nodes_with_embeddings(self, query_bundle_with_embeddings)
151 query = self._build_vector_store_query(query_bundle_with_embeddings)
152 query_result = self._vector_store.query(query, **self._kwargs)
--> 153 return self._build_node_list_from_query_result(query_result)
-----------------------------------------
KeyError Traceback (most recent call last)
Cell In[25], line 1
----> 1 retriever.retrieve("thing")

...
File ~/miniconda3/envs/llm/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/retriever.py:153, in VectorIndexRetriever._get_nodes_with_embeddings(self, query_bundle_with_embeddings)
151 query = self._build_vector_store_query(query_bundle_with_embeddings)
152 query_result = self._vector_store.query(query, **self._kwargs)
--> 153 return self._build_node_list_from_query_result(query_result)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/retriever.py:116, in VectorIndexRetriever._build_node_list_from_query_result(self, query_result)
114 assert isinstance(self._index.index_struct, IndexDict)
115 print('query_result.ids:', query_result.ids)
--> 116 node_ids = [
117 self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
118 ]
119 nodes = self._docstore.get_nodes(node_ids)
120 query_result.nodes = nodes

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/llama_index/indices/vector_store/retrievers/retriever.py:117, in <listcomp>(.0)
114 assert isinstance(self._index.index_struct, IndexDict)
115 print('query_result.ids:', query_result.ids)
116 node_ids = [
--> 117 self._index.index_struct.nodes_dict[idx] for idx in query_result.ids
118 ]
119 nodes = self._docstore.get_nodes(node_ids)
120 query_result.nodes = nodes

KeyError: '10aec5a4-aa16-470a-9f3d-b2b1972de3f5'
When you persist two into one directory, are you setting index ids?

Could have sworn we solved this at some point πŸ˜…
Since the index store should be passing I'm the node ids to the query method of the vector store
Yes, I set index ids for each of the indexes. I tried 0.8.16 and got the same error.
Can you share the code? Curious what's going on lol
@Logan M I made a minimal example that produces error for me on 0.8.16. Thanks for the help!
Ohhh you are creating the retriever yourself, this is the issue

use index.as_retriver(...) or index.as_query_engine(...) so that the node_ids properly get passed to the retriever
Since the vector store has all the node ids, and you stored multiple indexes, it needs some way of knowing which node ids actually belong to the current index

Under the hood this is done by passing node ids to the retriever

But since you are creating the index yourself, this step is missed
Hope that makes some sense
Thanks, this solves the problem. Although, the first argument to VectorIndexRetriever ctor is the index, so I thought the retriever should have had access to the node ids in the index. Anyhow, I am unblocked, thank you Logan.
Add a reply
Sign up and join the conversation on Discord