Find answers from the community

Updated last year

Hi all whenever I attempt to remove a

Hi all, whenever I attempt to remove a document from my Index using index.delete_ref_doc(document_id, delete_from_docstore=True), it does not fully remove said document from the docstore. It seems like the docstore/metadata collection still contains an arbitrary (?) _id, as well as a doc_hash property. I checked out the mongo_docstore, mongodb_kvstore and the keyval_docstore files but cannot find out why this behaviour would occur. Any advice?
For context, I'm using a Mongodb index / docstore and a weaviate vectorindexstore

The document is properly deleted from all other places as well
L
O
57 comments
Hmmm, I was not able to replicate this, at least locally

Plain Text
>>> from llama_index import Document, VectorStoreIndex
>>> doc1 = Document.example()
>>> doc2 = Document.example()
>>> doc1.doc_id = "doc1"
>>> doc2.doc_id = "doc2"
>>> index = VectorStoreIndex.from_documents([doc1, doc2])
>>> index.storage_context.persist(persist_dir="./storage_1")
>>> index.delete_ref_doc("doc1", delete_from_docstore=True)
>>> index.storage_context.persist(persist_dir="./storage_2")


doc1 is not present anywhere πŸ€”

Looking at the code, it feels almost impossible that the metadata collection doesn't get deleted from 🀯
Plain Text
def delete_ref_doc(self, ref_doc_id: str, raise_error: bool = True) -> None:
    """Delete a ref_doc and all it's associated nodes."""
    ref_doc_info = self.get_ref_doc_info(ref_doc_id)
    if ref_doc_info is None:
        if raise_error:
            raise ValueError(f"ref_doc_id {ref_doc_id} not found.")
        else:
            return
  
    for doc_id in ref_doc_info.node_ids:
        self.delete_document(doc_id, raise_error=False, remove_ref_doc_node=False)
  
    self._kvstore.delete(ref_doc_id, collection=self._metadata_collection)
    self._kvstore.delete(ref_doc_id, collection=self._ref_doc_collection)


The only possibility is that the first check returned πŸ€”
Hmmmmm let me set raise_error to True and check what's going on.
Raise error on True, yet no exception
VectorStore is emptied, with that logic I can somewhat rule out that the ref_doc_info is None right?
I think so πŸ€” The code I sent above is the delete function in the docstore

But when you call delete on an index, here's what gets run
Plain Text
def delete_ref_doc(
    self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
) -> None:
    """Delete a document and it's nodes by using ref_doc_id."""
    ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
    if ref_doc_info is None:
        logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
        return

    self.delete_nodes(
        ref_doc_info.node_ids,
        delete_from_docstore=False,
        **delete_kwargs,
    )

    if delete_from_docstore:
        self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
I have my logging on debug, I think the warning would already show up too
But why would ref_doc_info be none? Unless this was storage that you created before we properly tracked this? πŸ₯΄
When I call the delete, I also delete another record using the same mongodb client, which works as expected. So that rules out MongoDB working unexpected too
The storagecontext gets created at runtime and the only thing that changes per request is the WeaviateVectorStore, to omit the problem I had prior, where weaviate would store all entries under the same index name
so get_ref_doc_info relies on the ref_doc_collection. This always gets updated when you insert, but only if node.ref_doc_id is not None

Then on delete, it pulls from that same ref doc collection, and then goes about deleting things πŸ€”

Really not sure where things are going wrong here
logs are:

Plain Text
api_1       | DEBUG:root:New index name: QApp_4d1c9227_7319_4a6d_8928_8151e13299f3
api_1       | INFO:llama_index.indices.loading:Loading indices with ids: ['4d1c9227-7319-4a6d-8928-8151e13299f3']
api_1       | DEBUG:root:About to remove child_nodes with ids :['47f8a0c4-cc7e-469f-9b4d-8b08884e783a', 'bd039b1b-4bf4-40bc-9341-8acfc793283d', '9ddd8bcc-4594-404d-a307-ee78664381b1', 'af9e0d12-ad13-4a97-8cd7-f546f4565d59', '86c2fb67-e111-4feb-ba99-4dd1d7c69f70']
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/54360120-a9db-4a5e-b738-8abfffc99614 HTTP/1.1" 204 0
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/09a35b71-5406-4c0c-b73c-09b2cbb56144 HTTP/1.1" 204 0
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/e8bdf8b0-47d9-42b7-9860-51ecc30a33cd HTTP/1.1" 204 0
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/2ed4142f-fe95-494e-8c6a-f45932db1ffb HTTP/1.1" 204 0
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1       | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/453c85d7-68ec-46e7-ba56-7355b3acc54d HTTP/1.1" 204 0
api_1       | DEBUG:fsspec.local:open file: /app/src/storage/graph_store.json
Which does show it's only sending out delete requests to weaviate
But I'm not getting any logs from urllib3 related to Mongodb whenever I create documents either, so idk how to take that really
are you able to inspect the mongodb database after you create the index? Does the ref doc collection make sense?
Yeah I got mongodb compass running here
wait, which one is the ref_doc_collection?
I renamed my collections πŸ˜…
is that the one for the docstore?
Yea, the docstore has three collections under it

Should be f"{self._namespace}/ref_doc_info"
Mine only has one?
I'm not doing anything ridiculous here right?
Plain Text
def _setup_mongo_docstore(app: FastAPI):
    """Setup MongoDB connection."""
    mongodb_client = MongoDocumentStore.from_uri(
        _determine_mongodb_uri(),
        db_name=settings.db_docstore_collection,
    )
    app.state.mongo_docstore = mongodb_client
For clarity's sake
Yea that looks right...
storage context
Plain Text
def _setup_storage_context(app: FastAPI):
    """Setup LLamaIndex StorageContext class"""
    storage_context = StorageContext.from_defaults(
        docstore=app.state.mongo_docstore,
        vector_store=app.state.vector_store,
        index_store=app.state.mongo_indexstore,
    )
    app.state.storage_context = storage_context
Those are my only references in the project to the docstore
Plain Text
def __init__(
    self,
    kvstore: BaseKVStore,
    namespace: Optional[str] = None,
) -> None:
    """Init a KVDocumentStore."""
    self._kvstore = kvstore
    self._namespace = namespace or DEFAULT_NAMESPACE
    self._node_collection = f"{self._namespace}/data"
    self._ref_doc_collection = f"{self._namespace}/ref_doc_info"
    self._metadata_collection = f"{self._namespace}/metadata"


It uses those three collections, and calls self._kvstore.put(key, val, collection=....) for each collection type πŸ˜…

put() for mongodb is this

Plain Text
def put(
    self,
    key: str,
    val: dict,
    collection: str = DEFAULT_COLLECTION,
) -> None:
    """Put a key-value pair into the store.
    
    Args:
        key (str): key
        val (dict): value
        collection (str): collection name
    
    """
    val = val.copy()
    val["_id"] = key
    self._db[collection].replace_one(
        {"_id": key},
        val,
        upsert=True,
    )
How is it not creating more than one collection πŸ₯΄
Yeah I checked this out as well, this is the Mongodbkvstore right?
Yup πŸ‘
And it's used by the KVDocumentStore
When is it supposed to initialize those collections?
I guess the moment the kvdocumentstore gets initialized?
when it calls add_documents() it inserts into all those collections

But there isn't any initialization really, it just starts throwing stuff into them (I'm assuming mongodb just handles that? Or is supposed to?)
Yeah I think that's how it works. The moment a new key gets referenced to the client it will insert, or create a new collection and insert
I'm gonna get some groceries real quick, I'll be back to debug in ~40 mins tops πŸ˜…
haha sounds good!
Pfffft I've been browsing through the source code but I cannot find any reason for this behaviour to occur
Maybe you need to try inserting nodes directly into the docstore? Or setting a break point and debugging? πŸ˜…

Does mongodb atlas have some limit on the number of collections?
Nope, I quit mongodbatlas for now, It's just mongodb locally running in a container rn
damn πŸ˜… At this point I'd be using pdb.set_trace() inside of the llama-index code to make sure the docstore is actually calling the functions I think it is

Does the mongdb container have any logs for insertions ?
Hmm I don't think so
But I might be able to find something, sec
Nevermind I just attempted to merge some code and now everything is broken πŸ˜‚
see you in two hours
rip :PepeHands:
Okay I'm back to business
Mongodb logs in my container indicate nothing is even accessing the database, which is kinda weird
Somehow all basic functionality broke, I'll try again tomorrow because this isn't leading anywhere
Thanks for your insights so far!
Dang, sorry to hear that πŸ˜… Hopefully more luck tomorrow!
Add a reply
Sign up and join the conversation on Discord