Find answers from the community

Updated 3 months ago

issue

issue:
query() returns id when using simplevectorstore
but when faissvectorstore, query() returns ref_doc_ids

Q: how can i retrieve nodes when using a faiss query - e.g. using the returned ref_doc_ids? (e.g. get_nodes does not work w ref_doc_id - is there another func?)

Context: the vectordir was created using the follwoing code:
Attachment
image.png
L
B
32 comments
hmm super weird.

Faiss isn't super used, so I guess this went unoticed.

I can try and look into this soon
Hey @Logan M, some more context

i have a 9mb vectorstore and loading takes longer when using simplevectorstore than faiss
Attachment
image.png
yea that makes sense, faiss is pretty optimized
ultimately it returns the ref_doc_ids and respective similarities
Attachment
image.png
which is great

i just need to map them to the node content now
OH you are using vector_store.query() directly

What happens if you use something like index.as_retriever(similarity_top_k=2).retrieve("query")
that works nicely to retrieve ids even w faiss
Attachment
image.png
Attachment
image.png
a) speed of load - it'd be great to go only load faiss vectorstore
b) querying using a pre-obtained raw embedding
not sure what you mean by a) ? It needs to load the docstore/indexstore/faiss to operate, since faiss cant store text

b) This is possible actually

Plain Text
from llama_index import QueryBundle

index.as_retriever(similarity_top_k=2).retrieve(QueryBundle("not used", embedding=[...]))
ok - re: 1-> so it seems at minimum, i'd need the docstore in memory to retrieve the text associated with a doc_ref_id ?

wondering how i'd load what's required to resolve a TextNode with minnimum time spend (e.g. meb can just load the docstore alone & not build the full index?)
I mean, the docstore and vector store are the heaviest components, and both need to be loaded.

In comparison, loading the entire index has minimum impact on top of the above 2

We do have some remote support for the docstore/indexstore (redis, mongodb), which might help with the load times
gotcha

so here i loaded the json docstore manually
Attachment
image.png
Attachment
image.png
and it was super fast <0.1s
Any way to do this style of loading more idomatically/non-breaking w llama?
compared to >2s here:
Attachment
image.png
if i dive deeper and find out i'll let ya know why it's timing like this on my end
do you know if index.refresh is possible with Faiss?
Attachment
image.png
Attachment
image.png
Hmm, does faiss support delete yet in their api? πŸ˜… if so, then it's probably possible to make it work
@Logan M here's what i did enable delete:
Attachment
image.png
then had to add this delete_by_value
Attachment
image.png
interesting πŸ€” I think there might be a better way to do this. I'll try to look into it πŸ™‚
Constraint: my ref doc id's are positions/indexes in my vectorstore - so that makes it easier
import faiss persist_path = 'intents/intents_llamaindex-faiss/vector_store.json' # index = faiss.read_index("intents/intents_llamaindex-faiss/vector_store.json") index = faiss.read_index(persist_path) # Reconstruct all vectors from the index all_vectors = np.array([index.reconstruct(i) for i in range(index.ntotal)]) # Define the index of the item to delete index_of_item_to_delete = int(ref_doc_id) if index_of_item_to_delete >= (len(all_vectors) - 1): return # Remove vector from all_vectors at the specified position modified_vectors = np.delete(all_vectors, index_of_item_to_delete, axis=0) # Create a new IndexFlat with the modified set of vectors d = self._faiss_index.d # Dimension of your vectors new_index = faiss.IndexFlatL2(d) new_index.add(modified_vectors) # self.persist() faiss.write_index(new_index, persist_path) print('saved new index w deleted vector')

here's this in case this saves any time
then in data_structs.py:

def delete_by_value(self, doc_id: str) -> None: """Delete a Node by value.""" keys_to_delete = [key for key, value in self.nodes_dict.items() if value == doc_id] for key in keys_to_delete: del self.nodes_dict[key] if not keys_to_delete: print(f"Value {doc_id} not found in nodes_dict.") then this
@Logan M If you don't mind keep me posted on the better way to do it if/when!
Add a reply
Sign up and join the conversation on Discord