Find answers from the community

Updated 5 months ago

Hi

At a glance
Hi,
We are using OpensearchVectorStore (along with appropriate params for OpensearchVectorClient).
The index we are using is GPTVectorStoreIndex.
We are able to successfully create the index (GPTVectorStoreIndex.from_documents(docs, service_context=service_context, storage_context=storage_context).
The documents are visible in OpenSearch.

However we want to use refresh_ref_docs function, so that both insert and update operations can be taken care of.
I am assigning unique doc_id to the documents (I'm using filename_as_id=True in SimpleDirectoryReader).

But doing index.refresh_ref_docs(docs) is giving a 404 not found error, telling that it cannot find the given doc_id in index. But isn't refresh_ref_docs will also add a doc to the index if it is not there?
L
1
14 comments
refresh_ref_docs won't actually work with vector store integrations.

It depends on the docstore layer for managing which documents have been inserted, and which nodes belong to each inserted document

You can explicitly enable the docstore when creating/using your index by setting the override kwarg

Plain Text
VectorStoreIndex.from_documents(docs, service_context=service_context, storage_context=storage_context, store_nodes_override=True)


But then you'll have to manage the docstore and index store yourself

Plain Text
# saving
index.storage_context.docstore.persist(persist_path="storage/docstore.json")
index.storage_context.index_store.persist(persist_path="storage/index_store.json")
...
# loading
storage_context = StorageContext.from_defaults(
  docstore=SimpleDocumentStore.from_persist_dir("./storage"),
  index_store=SimpleIndexStore.from_persist_dir("./storage"),
  vector_store=vector_store
)
Yeah thanks, it worked. Basically we have to give some docstore to our storage context. @Logan M is there any way that we can get docStore from OpenSearch itself? So for docStore, do we have something like OpensearchDocStore (like for vector store we have OpensearchVectorStore)?
@Logan M unfortunately when I am making a change to an already indexed document, I am getting a 404 Not Found error in opensearch-py when llama-index tries to delete the old doc from opesearch (to re-index with the new content). However if the same files (without any change) are passed in refresh_ref_docs, then it gives all False in return (which is the expected behaviour). When I add a new file as well, that also work. It's only on file updates that it errors out, inside the opensearch doc delete function somwhere in llama-index. I suspect this is because the doc_id llama-index assigns is different from the _id that opensearch creates on its own, and the delete function expects _id but we give it the doc_id.
I made a change it opensearch.py and when I comment-in the line with the hardcoded _id (and not doc_id) picked from opensearch dashbords, it works and deletes the document (and then the new version gets indexed == doc updated successfully). Is there something I am missing or does llama-index need a small patch?! :p
Attachment
image.png
Could be possible! Although it would need to be implemented

Check out some of the other integrations. It would need a docstore and index_store

It starts with a kvstore, and then the docstore and index_store are light abstractions on top of that
https://github.com/jerryjliu/llama_index/tree/main/llama_index/storage/kvstore
Ah, I think the delete function is not implemented correctly here πŸ˜…

It should delete all nodes that are under a specific doc_id, rather than deleting individual nodes by node_id
The doc_id is a field in the metadata. Is it possible for opensearch to do that sort of filtered delete?
@Logan M I checked the documentation for _os_client.delete function. Actually it expects '_id' as the argument (not doc_id). '_id' is something which is created by opensearch (which is also always same as metadata.node__content.'id', as visible from opensearch dashboard).
In the local docstore.json, this is same as the key field for each of the object (and is also = _data.id_).

I am attaching a screenshot of both Opensearch dashboard and docstore.json for your reference and better understand.

So in short, we need to pass this '_id' field to _os_client.delete instead of doc_id. Is there some way to access it? I couldn't figure out.
Attachments
image.png
image.png
Yea what I'm saying is, all the vector stores are setup to delete by doc_id, not by node id (or id_, same thing)
So the opensearch vector store should also be doing the same -- but its not right now
If we want to delete by node id, that needs to be another function (that likely every vector store should have)
@ravitheja
Add a reply
Sign up and join the conversation on Discord