You are using the ingestion pipeline, so as long as your input documents have the same document ids for the same document, it will upsert properly. The ingestion pipeline is already inserting into your docstore and vector store, no need to do this twice.
You can also manually manage your data.
index.delete(ref_doc_id)
will delete using the original input document ids
You can also delete using the vector store and docstore directly
vector_store.delete(ref_doc_id)
docstore.delete_ref_doc(ref_doc_id)
You can also delete nodes (not every vector store implements this yet, qdrant does tho)
vector_store.delete_nodes(node_ids=[...])
for node_id in node_ids:
docstore.delete_document(node_id)
Logan this is a good input...just to understand how the doc_ids system is working...If I load a document, say a .txt file, than I modify it, than I reaload it, will the id be the same automatically, so it will be upserted?
if you set the id to be the same (like setting it to be a file name), then yes
yes* (if you use an ingestion pipeline with docstore + vector store)
Thank you for sharing the details. I was wondering how can I get the ref doc id based on the filename. For instance, user selects a file from UI (I have filename), how can I use it to get the ref doc id? I am struggling to find a method in ingestion pipeline where i can fetch the ref doc id of that file name and then execute the ops
Is there a way to get associations of filename
to ref_doc_id
Why I try to access the ref_doc_info
property on index object (VectorStoreIndex class object) it gives me NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet.
error.
If I initialize the vector store with _store_nodes_override: True
it gives me empty dict {} when I access the ref_doc_info
property
the vector store index initialization looks like this
index = VectorStoreIndex.from_vector_store(
self.vector_store,
Settings.embed_model,
store_nodes_override=True
)
I am using qdrant vector db in my application
IngestionPipeline(
transformations=[self.parser],
docstore=self.docstore,
vector_store=self.qdrant_search.vector_store,
cache=IngestionCache(
cache=RedisCache.from_host_and_port(
host=self.config.redis_config.host,
port=self.config.redis_config.port
),
collection="redis_cache",
),
docstore_strategy=DocstoreStrategy.UPSERTS,
)
no need to set store nodes override, the ingestion pipeline is already doing all the work in your setup
You just need to make sure your input documents have consistent IDs
If you are using simple directory reader, you can do something like SimpleDirectoryReader(..., filename_as_id=True)
@Logan M Still facing one issue. The docstore is not updated after deletion. The vector store gets updated automatically once the nodes are deleted. Do I need to call explicitely a method to refresh the docstore to ensure it has deleted the references from the docstore post deletion?
# Delete from docstore
logger.info(f"Deleting document with ids from docstore: {ref_doc_ids_to_delete}")
for ref_doc_id in ref_doc_ids_to_delete:
try:
logger.info("{} | {} | {}".format(self.ingestion.docstore._ref_doc_collection, self.ingestion.docstore._metadata_collection, self.ingestion.docstore._node_collection))
await self.ingestion.docstore.adelete_ref_doc(ref_doc_id, raise_error=True)
except Exception as e:
logger.warning(f"Docstore deletion warning for {ref_doc_id}: {str(e)}")
By deletion, what do you mean in this case?
The above code seems mostly correct?
What I intend to do is following
I have ref_doc_ids_to_delete
list , I want to delete the ref doc ids from both docstore and vectorstore and it should reflect in the qdrant db (collection) and redis docstore (metadata, doc collections) should be updated)
docstore_strategy=DocstoreStrategy.UPSERTS
, I have updated this to docstore_strategy=DocstoreStrategy.UPSERTS_AND_DELETE
Current issue is that upon deleting ref doc ids, vector store is showing the updated state when I check the collection items.
However, when I check redis collection (docstore) the old file references still exist
Docstore Initialization
self.docstore = RedisDocumentStore.from_host_and_port(
host=self.config.redis_config.host, port=self.config.redis_config.port, namespace="xxx"
)
self.vector_store = QdrantVectorStore(**vector_store_config)
self.index = VectorStoreIndex.from_vector_store(
self.vector_store,
Settings.embed_model,
store_nodes_override=True,
)
I have tried everything but nothing seem to work, any help would be greatly appreciated.
self.vector_store = QdrantVectorStore(**vector_store_config)
self.index = VectorStoreIndex.from_vector_store(
self.vector_store,
Settings.embed_model,
store_nodes_override=True,
)
I found one issue. The docstore linked with ingestion pipeline object and qdrant vector store index object are different
<llama_index.storage.docstore.redis.base.RedisDocumentStore object at 0x7f989f7985e0>
<llama_index.core.storage.docstore.simple_docstore.SimpleDocumentStore object at 0x7f98a4636820>
The combination of QdrantVectorStore and RedisDocumentStore is not working together. After reviewing the code, it seems the VectorStoreIndex does not support key value docstores hence the storage context is overriden by the default docstore (SimpleDocumentStore). Please correct me if I am wrong @Logan M
That's not correct -- as long as you attach the vector store and docstore to the ingestion pipeline (and you save the docstore somewhere!) It should work fine.
It's pretty hard to debug without seeing some minimum version of your code flow
Hello! I am reading through this thread and the original github issue. It looks like the github issue suggests that upserts are virtually just delete and re-inserts.
Reading this thread it sounds like IngestionPipeline should handle upserts automatically if input documents have consistent IDs.
So I wonder if anyone could help me clarify which is true, and if I'm doing something wrong.
I am setting a deterministic id_
on my Document
list as I pass into pipeline.run(documents=documents)
. This results in the doc_id
and ref_doc_id
property reflecting that custom ID but the actual id
in Weaviate
is different and therefore the document is duplicated in the Vector Store every time I run the ingestion.
Here is a sample:
storage_context = StorageContext.from_defaults(vector_store=vector_store)
pipeline = IngestionPipeline(
transformations=transformers,
vector_store=vector_store,
)
nodes = pipeline.run(documents=documents)
VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
show_progress=True,
embed_model=Settings.embed_model
)
An upsert is a delete + reinsert yes
The ingestion pipeline does handle this, assuming you attached both a docstore and vector store to it
I see, so if I do not use a doc store then I cannot expect it to work that way
I am not using a doc store... so I will take that as my answer. Thank you!