issue

BBlake

issue:
query() returns id when using simplevectorstore
but when faissvectorstore, query() returns ref_doc_ids

Q: how can i retrieve nodes when using a faiss query - e.g. using the returned ref_doc_ids? (e.g. get_nodes does not work w ref_doc_id - is there another func?)

Context: the vectordir was created using the follwoing code:

Attachment

32 comments

LLogan M

hmm super weird.

Faiss isn't super used, so I guess this went unoticed.

I can try and look into this soon

BBlake

Hey @Logan M, some more context

i have a 9mb vectorstore and loading takes longer when using simplevectorstore than faiss

Attachment

LLogan M

yea that makes sense, faiss is pretty optimized

BBlake

ultimately it returns the ref_doc_ids and respective similarities

Attachment

BBlake

which is great

i just need to map them to the node content now

LLogan M

OH you are using vector_store.query() directly

What happens if you use something like index.as_retriever(similarity_top_k=2).retrieve("query")

BBlake

that works nicely to retrieve ids even w faiss

Attachment

BBlake

Issue:

BBlake

Attachment

BBlake

a) speed of load - it'd be great to go only load faiss vectorstore
b) querying using a pre-obtained raw embedding

LLogan M

not sure what you mean by a) ? It needs to load the docstore/indexstore/faiss to operate, since faiss cant store text

b) This is possible actually

Plain Text

from llama_index import QueryBundle

index.as_retriever(similarity_top_k=2).retrieve(QueryBundle("not used", embedding=[...]))

BBlake

ok - re: 1-> so it seems at minimum, i'd need the docstore in memory to retrieve the text associated with a doc_ref_id ?

wondering how i'd load what's required to resolve a TextNode with minnimum time spend (e.g. meb can just load the docstore alone & not build the full index?)

LLogan M

I mean, the docstore and vector store are the heaviest components, and both need to be loaded.

In comparison, loading the entire index has minimum impact on top of the above 2

We do have some remote support for the docstore/indexstore (redis, mongodb), which might help with the load times

BBlake

gotcha

so here i loaded the json docstore manually

Attachment

BBlake

Attachment

BBlake

and it was super fast <0.1s

BBlake

Any way to do this style of loading more idomatically/non-breaking w llama?

BBlake

compared to >2s here:

Attachment

LLogan M

hmm but that is basically how we load it?

https://github.com/jerryjliu/llama_index/blob/39ff38293b5d4d1fc33d8fbadacf9bb57861ccbe/llama_index/storage/kvstore/simple_kvstore.py#L75C1-L75C1

Wonder where the overhead is 🤔

BBlake

if i dive deeper and find out i'll let ya know why it's timing like this on my end

BBlake

do you know if index.refresh is possible with Faiss?

Attachment

BBlake

Attachment

LLogan M

Hmm, does faiss support delete yet in their api? 😅 if so, then it's probably possible to make it work

BBlake

@Logan M here's what i did enable delete:

BBlake

Attachment

BBlake

then had to add this delete_by_value

Attachment

LLogan M

interesting 🤔 I think there might be a better way to do this. I'll try to look into it 🙂

BBlake

Constraint: my ref doc id's are positions/indexes in my vectorstore - so that makes it easier

BBlake

      import faiss
        persist_path = 'intents/intents_llamaindex-faiss/vector_store.json'
        # index = faiss.read_index("intents/intents_llamaindex-faiss/vector_store.json")
        index = faiss.read_index(persist_path)
        # Reconstruct all vectors from the index
        all_vectors = np.array([index.reconstruct(i) for i in range(index.ntotal)])
        # Define the index of the item to delete
        index_of_item_to_delete = int(ref_doc_id)

        if index_of_item_to_delete >= (len(all_vectors) - 1):
            return
        # Remove vector from all_vectors at the specified position
        modified_vectors = np.delete(all_vectors, index_of_item_to_delete, axis=0)
        # Create a new IndexFlat with the modified set of vectors
        d = self._faiss_index.d  # Dimension of your vectors
        new_index = faiss.IndexFlatL2(d)

        new_index.add(modified_vectors)
        # self.persist()
        faiss.write_index(new_index, persist_path)
        print('saved new index w deleted vector')

here's this in case this saves any time

BBlake

then in data_structs.py:

  def delete_by_value(self, doc_id: str) -> None:
        """Delete a Node by value."""
        keys_to_delete = [key for key, value in self.nodes_dict.items() if value == doc_id]
        for key in keys_to_delete:
            del self.nodes_dict[key]
        if not keys_to_delete:
            print(f"Value {doc_id} not found in nodes_dict.")

then this

BBlake

@Logan M If you don't mind keep me posted on the better way to do it if/when!

LLogan M

For sure!

Add a reply

Find answers from the community

issue