Hmmm, I was not able to replicate this, at least locally
>>> from llama_index import Document, VectorStoreIndex
>>> doc1 = Document.example()
>>> doc2 = Document.example()
>>> doc1.doc_id = "doc1"
>>> doc2.doc_id = "doc2"
>>> index = VectorStoreIndex.from_documents([doc1, doc2])
>>> index.storage_context.persist(persist_dir="./storage_1")
>>> index.delete_ref_doc("doc1", delete_from_docstore=True)
>>> index.storage_context.persist(persist_dir="./storage_2")
doc1 is not present anywhere π€
Looking at the code, it feels almost impossible that the metadata collection doesn't get deleted from π€―
def delete_ref_doc(self, ref_doc_id: str, raise_error: bool = True) -> None:
"""Delete a ref_doc and all it's associated nodes."""
ref_doc_info = self.get_ref_doc_info(ref_doc_id)
if ref_doc_info is None:
if raise_error:
raise ValueError(f"ref_doc_id {ref_doc_id} not found.")
else:
return
for doc_id in ref_doc_info.node_ids:
self.delete_document(doc_id, raise_error=False, remove_ref_doc_node=False)
self._kvstore.delete(ref_doc_id, collection=self._metadata_collection)
self._kvstore.delete(ref_doc_id, collection=self._ref_doc_collection)
The only possibility is that the first check returned π€
Hmmmmm let me set raise_error to True and check what's going on.
Raise error on True, yet no exception
VectorStore is emptied, with that logic I can somewhat rule out that the ref_doc_info is None
right?
I think so π€ The code I sent above is the delete function in the docstore
But when you call delete on an index, here's what gets run
def delete_ref_doc(
self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
) -> None:
"""Delete a document and it's nodes by using ref_doc_id."""
ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
if ref_doc_info is None:
logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
return
self.delete_nodes(
ref_doc_info.node_ids,
delete_from_docstore=False,
**delete_kwargs,
)
if delete_from_docstore:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
I have my logging on debug, I think the warning would already show up too
But why would ref_doc_info be none? Unless this was storage that you created before we properly tracked this? π₯΄
When I call the delete, I also delete another record using the same mongodb client, which works as expected. So that rules out MongoDB working unexpected too
The storagecontext gets created at runtime and the only thing that changes per request is the WeaviateVectorStore, to omit the problem I had prior, where weaviate would store all entries under the same index name
so get_ref_doc_info
relies on the ref_doc_collection
. This always gets updated when you insert, but only if node.ref_doc_id
is not None
Then on delete, it pulls from that same ref doc collection, and then goes about deleting things π€
Really not sure where things are going wrong here
logs are:
api_1 | DEBUG:root:New index name: QApp_4d1c9227_7319_4a6d_8928_8151e13299f3
api_1 | INFO:llama_index.indices.loading:Loading indices with ids: ['4d1c9227-7319-4a6d-8928-8151e13299f3']
api_1 | DEBUG:root:About to remove child_nodes with ids :['47f8a0c4-cc7e-469f-9b4d-8b08884e783a', 'bd039b1b-4bf4-40bc-9341-8acfc793283d', '9ddd8bcc-4594-404d-a307-ee78664381b1', 'af9e0d12-ad13-4a97-8cd7-f546f4565d59', '86c2fb67-e111-4feb-ba99-4dd1d7c69f70']
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/54360120-a9db-4a5e-b738-8abfffc99614 HTTP/1.1" 204 0
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/09a35b71-5406-4c0c-b73c-09b2cbb56144 HTTP/1.1" 204 0
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/e8bdf8b0-47d9-42b7-9860-51ecc30a33cd HTTP/1.1" 204 0
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/2ed4142f-fe95-494e-8c6a-f45932db1ffb HTTP/1.1" 204 0
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "POST /v1/graphql HTTP/1.1" 200 127
api_1 | DEBUG:urllib3.connectionpool:http://weaviate:8080 "DELETE /v1/objects/QApp_4d1c9227_7319_4a6d_8928_8151e13299f3/453c85d7-68ec-46e7-ba56-7355b3acc54d HTTP/1.1" 204 0
api_1 | DEBUG:fsspec.local:open file: /app/src/storage/graph_store.json
Which does show it's only sending out delete requests to weaviate
But I'm not getting any logs from urllib3 related to Mongodb
whenever I create documents either, so idk how to take that really
are you able to inspect the mongodb database after you create the index? Does the ref doc collection make sense?
Yeah I got mongodb compass running here
wait, which one is the ref_doc_collection?
I renamed my collections π
is that the one for the docstore?
Yea, the docstore has three collections under it
Should be f"{self._namespace}/ref_doc_info"
I'm not doing anything ridiculous here right?
def _setup_mongo_docstore(app: FastAPI):
"""Setup MongoDB connection."""
mongodb_client = MongoDocumentStore.from_uri(
_determine_mongodb_uri(),
db_name=settings.db_docstore_collection,
)
app.state.mongo_docstore = mongodb_client
storage context
def _setup_storage_context(app: FastAPI):
"""Setup LLamaIndex StorageContext class"""
storage_context = StorageContext.from_defaults(
docstore=app.state.mongo_docstore,
vector_store=app.state.vector_store,
index_store=app.state.mongo_indexstore,
)
app.state.storage_context = storage_context
Those are my only references in the project to the docstore
def __init__(
self,
kvstore: BaseKVStore,
namespace: Optional[str] = None,
) -> None:
"""Init a KVDocumentStore."""
self._kvstore = kvstore
self._namespace = namespace or DEFAULT_NAMESPACE
self._node_collection = f"{self._namespace}/data"
self._ref_doc_collection = f"{self._namespace}/ref_doc_info"
self._metadata_collection = f"{self._namespace}/metadata"
It uses those three collections, and calls
self._kvstore.put(key, val, collection=....)
for each collection type π
put()
for mongodb is this
def put(
self,
key: str,
val: dict,
collection: str = DEFAULT_COLLECTION,
) -> None:
"""Put a key-value pair into the store.
Args:
key (str): key
val (dict): value
collection (str): collection name
"""
val = val.copy()
val["_id"] = key
self._db[collection].replace_one(
{"_id": key},
val,
upsert=True,
)
How is it not creating more than one collection π₯΄
Yeah I checked this out as well, this is the Mongodbkvstore right?
And it's used by the KVDocumentStore
When is it supposed to initialize those collections?
I guess the moment the kvdocumentstore
gets initialized?
when it calls add_documents()
it inserts into all those collections
But there isn't any initialization really, it just starts throwing stuff into them (I'm assuming mongodb just handles that? Or is supposed to?)
Yeah I think that's how it works. The moment a new key gets referenced to the client it will insert, or create a new collection and insert
I'm gonna get some groceries real quick, I'll be back to debug in ~40 mins tops π
Pfffft I've been browsing through the source code but I cannot find any reason for this behaviour to occur
Maybe you need to try inserting nodes directly into the docstore? Or setting a break point and debugging? π
Does mongodb atlas have some limit on the number of collections?
Nope, I quit mongodbatlas for now, It's just mongodb locally running in a container rn
damn π
At this point I'd be using pdb.set_trace()
inside of the llama-index code to make sure the docstore is actually calling the functions I think it is
Does the mongdb container have any logs for insertions ?
But I might be able to find something, sec
Nevermind I just attempted to merge some code and now everything is broken π
Okay I'm back to business
Mongodb logs in my container indicate nothing is even accessing the database, which is kinda weird
Somehow all basic functionality broke, I'll try again tomorrow because this isn't leading anywhere
Thanks for your insights so far!
Dang, sorry to hear that π
Hopefully more luck tomorrow!