delete_ref_doc
method is used to delete a document and its associated nodes from the index and docstore.delete_from_docstore
parameter is set to False
when calling delete_ref_doc
. This means that the document will be removed from the index structure but not from the docstore itself. To ensure complete deletion, you should set this parameter to True
.index.delete_ref_doc("doc_id_0", delete_from_docstore=True)
SimpleDocumentStore
, you need to call docstore.persist()
after deletion to save the changes to disk.delete_nodes
method, you also need to specify delete_from_docstore=True
to remove the nodes from the docstore.delete_from_docstore=True
when calling the delete_ref_doc
method.self._vector_store.delete(ref_doc_id, **delete_kwargs)
.self._delete_from_index_struct(ref_doc_id)
.delete_from_docstore=True
, the document is also deleted from the docstore using self._delete_from_docstore(ref_doc_id)
.index.delete_ref_doc(ref_doc_id, delete_from_docstore=True)
self._vector_store.stores_text
is True), or if self._store_nodes_override
is True, the deletion from the docstore is only performed when delete_from_docstore
is explicitly set to True.delete_ref_doc
method with delete_from_docstore=True
to ensure complete deletion from the index, vector store, and docstore.def delete_document_by_filename(self, filename): # First, we need to find the document ID associated with the filename # This assumes you've stored the filename in the metadata of the document results = self.chroma_collection.get( where={"filename": filename} ) if not results or len(results['ids']) == 0: print(f"No document found with filename: {filename}") return doc_id = results['ids'][0] # Delete from the vector store (ChromaDB) self.chroma_collection.delete(ids=[doc_id]) # Delete from the index and docstore self.vector_store_index.delete_ref_doc(doc_id, delete_from_docstore=True) print(f"Document with filename '{filename}' has been deleted.") # Optionally, persist changes to the docstore self.docstore.persist(persist_path=f"{self.persistence_path}/docstore.json")
delete_ref_doc
method of the VectorStoreIndex
to delete the document from both the index and the docstore. The delete_from_docstore=True
parameter ensures it's also removed from the docstore.self.delete_document_by_filename("example.txt")
delete_from_docstore
parameter:delete_from_docstore
parameter to True
. If you're not doing this, the documents will be removed from the index structure but not from the docstore itself. Make sure you're using the delete_ref_doc
method like this:index.delete_ref_doc(doc_id, delete_from_docstore=True)
self._vector_store.stores_text
is True), or if self._store_nodes_override
is True, the deletion from the docstore is only performed when delete_from_docstore
is explicitly set to True.SimpleDocumentStore
, you need to call docstore.persist()
after deletion to save the changes to disk.delete_from_docstore=True
:for doc_id in list_of_doc_ids: self.vector_store_index.delete_ref_doc(doc_id, delete_from_docstore=True)
self.docstore.persist(persist_path=f"{self.persistence_path}/docstore.json")
self.chroma_collection.delete(ids=list_of_doc_ids)
delete_documents
function is defined as an async function, but it's not being awaited in the FastAPI route where it's called. This means that the deletion process might not complete before the response is sent back to the client.delete_documents
function call in your FastAPI route:await self.delete_documents(file_name)
NotFoundError
is raised), the subsequent steps, including persisting the docstore, might not be executed.self.persistence_path
is correctly set and points to the right directory where your docstore should be persisted.persist
method is called on self.docstore
, but it's not clear from the provided code where this docstore
object is initialized or how it relates to the vector_store_index
.delete_documents
function in your FastAPI route:try: await self.delete_documents(file_name) except NotFoundError as e: raise HTTPException(status_code=404, detail=str(e))
delete_documents
function:3. Ensure that `self.docstore` is properly initialized and connected to your `vector_store_index`. You might need to access the docstore through the `vector_store_index`:
4. After the deletion process, you might want to refresh or rebuild your index to ensure it's fully up-to-date:
``
By implementing these changes, you should be able to ensure that documents are properly deleted from both the ChromaDB collection and the docstore when the
delete_documents` function is called.async def refresh_documents(self, documents: List[Document]) -> List[bool]: self.logger.info("Starting document refresh process") try: # Refresh the documents in the vector store index refreshed_docs = self.vector_store_index.refresh_ref_docs( documents, update_kwargs={"delete_kwargs": {"delete_from_docstore": True}} ) # Process the documents through the ingestion pipeline processed_nodes = await self.ingestion_pipeline.arun( documents=documents, show_progress=True ) # Update the Chroma collection for node in processed_nodes: self.chroma_collection.upsert( ids=[node.id_], embeddings=[node.embedding], documents=[node.text], metadatas=[node.metadata] ) # Persist changes to the docstore self.docstore.persist(persist_path=f"{self.persistence_path}/docstore.json") self.logger.info(f"Refresh complete. {sum(refreshed_docs)} documents were updated or added.") return refreshed_docs except Exception as e: self.logger.error(f"Error during refresh: {str(e)}") raise
refresh_documents
function does the following:refresh_ref_docs
method of the VectorStoreIndex
to update existing documents and add new ones. This method returns a list of booleans indicating which documents were refreshed or added.documents_to_refresh = [ Document(text="Updated content 1", id_="doc1"), Document(text="New content", id_="doc3") ] refreshed = await self.refresh_documents(documents_to_refresh)
async
to match the asynchronous nature of your FastAPI setup. Make sure to await
it when calling.refresh_ref_docs
method is used instead of refresh
as it's the more up-to-date method according to the LlamaIndex documentation.update_kwargs
parameter in refresh_ref_docs
ensures that documents are properly deleted from the docstore when updated.Document
objects have an id_
attribute. If they don't, you'll need to modify the function to generate or retrieve appropriate IDs.delete_ref_doc
on the vector_store_index
before deleting from the storage_context.docstore
. The delete_ref_doc
method might already be removing the document from the docstore, causing the subsequent delete_ref_doc
call on the docstore to fail.Key changes and explanations: 1. Removed the separate `self.storage_context.docstore.delete_ref_doc(id)` call, as `delete_ref_doc` on the `vector_store_index` should handle both index and docstore deletion when `delete_from_docstore=True`. 2. Added error handling around the `delete_ref_doc` call to catch and log any issues. 3. Moved the Chroma deletion after the index deletion to ensure consistency. 4. Changed `self.docstore.persist()` to `self.storage_context.docstore.persist()` to ensure we're persisting the correct docstore instance. If you're still seeing issues after these changes, you might want to add some debugging statements to check the state of your docstore before and after deletion:
def verify_deletion(self, doc_id: str) -> bool: # Check vector store index try: self.vector_store_index.get_document(doc_id) print(f"Document {doc_id} still exists in vector store index.") return False except ValueError: print(f"Document {doc_id} not found in vector store index.") # Check docstore if doc_id in self.storage_context.docstore.docs: print(f"Document {doc_id} still exists in docstore.") return False else: print(f"Document {doc_id} not found in docstore.") # Check Chroma collection chroma_results = self.chroma_collection.get(ids=[doc_id]) if chroma_results['ids']: print(f"Document {doc_id} still exists in Chroma collection.") return False else: print(f"Document {doc_id} not found in Chroma collection.") print(f"Document {doc_id} has been successfully deleted from all stores.") return True