Find answers from the community

Updated 4 months ago

Hey folks, sorry if this is a simple

At a glance
Hey folks, sorry if this is a simple question: I'm loading docs/nodes from the docstore, filtering/modifying, and would sometimes need to call adelete_document to delete the nodes from the docstore. I use the returned nodes node_id field to delete, but I am getting this error message:
list.remove(x): x not in list

Am i supposed to use a different field here and not node_id?
L
k
22 comments
Is that an error from your own code or from inside the framework?
@Logan M It appears to be coming from llama_index if i'm not mistaken
Plain Text
ag:dev:     await mongo_storage_context.docstore.adelete_document(
rag:dev:   File "/opt/homebrew/lib/python3.11/site-packages/llama_index/core/storage/docstore/keyval_docstore.py", line 459, in adelete_document
rag:dev:     await self._aremove_ref_doc_node(doc_id)
rag:dev:   File "/opt/homebrew/lib/python3.11/site-packages/llama_index/core/storage/docstore/keyval_docstore.py", line 427, in _aremove_ref_doc_node
rag:dev:     ref_doc_obj.node_ids.remove(doc_id)
rag:dev: ValueError: list.remove(x): x not in list
oh hmm. I wonder if this is because ref_doc_id is not being set properly for nodes when inserting
The way i insert nodes to the docstore is
Plain Text
storage_context.docstore.async_add_documents(
                    nodes=batch, batch_size=len(batch), allow_update=True)

Is this incorrect?
However, for deletion, i first load from docstore and get all the nodes using
Plain Text
docs = storage_context.docstore.docs.values()

And then make a unique set of the node_ids i need to delete for and use those to call deletion
@Logan M Do you think there's something wrong with the way i'm trying to delete old nodes? Might have to add a workaround as my use-cases specifically needs to "upsert" if the source was parsed and stored previously
Hi @Logan M sorry to double ping but this is currently a major issue for our app, aside from implementing our own MongoDB client for deletion, is there a native workaround for this issue with LlamaIndex?
You might have to give me more context/a way to duplicate this issue in a minimal example. I can't help much without having it on my end to replicate
@Logan M My apologies. We store the nodes by calling
Plain Text
storage_context.docstore.async_add_documents(
                    nodes=batch, batch_size=len(batch), allow_update=True)

We delete by first calling
Plain Text
docs = storage_context.docstore.docs.values()

to get all the nodes from the docstore, and then call this to delete them after mapping each nodes to their unique node_id and looping.
Plain Text
await storage_context.docstore.adelete_document(
                        doc_id=node_id)

It seems to be reproducible for us with PDFs but CSVs are able to delete fine (Maybe because I only store IndexNodes for CSVs). If you need more context, please let me know.
and I'm guessing batch is a list of nodes?
Yes, batch is a list of nodes
One thing to mention is that not all nodes fail to delete. Often times 10-20% of the nodes successfully delete but the rest fail due to the error list.remove(x): x not in list.

May need to add a try-catch block so that even if it's not provided in the ref list it still deletes what it can while logging the ones it couldn't?
thats why having a reproduce case helps πŸ˜… I'm wondering how it even gets to this state.

Yes, adding a try/except is an easy bandaid
This works fine for me
Plain Text
from llama_index.core.schema import Document
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.storage.docstore.mongodb import MongoDocumentStore

async def main():
    docstore = MongoDocumentStore.from_host_and_port("localhost", 27017)
    
    document = Document.example()
    document.metadata = {}
    nodes = TokenTextSplitter(chunk_size=25, chunk_overlap=0)([document])

    print(f"Adding {len(nodes)} nodes to the docstore.")

    await docstore.async_add_documents(nodes, batch_size=len(nodes), allow_update=True)
    
    nodes_dict = docstore.docs

    print(f"Retrieved {len(nodes_dict)} nodes from the docstore. Now deleting them.")

    for id_, node in nodes_dict.items():
        await docstore.adelete_document(id_)

if __name__ == "__main__":
    import asyncio 
    asyncio.run(main())
we actually recently also made the delete logic a lot less complicated, cleaned it up, etc.

I'm wondering if you still encounter this issue on the latest version of llama-index-core
Because we have this checked already explicitly I see

Plain Text
if doc_id in ref_doc_obj.node_ids:  # sanity check
  ref_doc_obj.node_ids.remove(doc_id)
hmm gotcha i'll try updating my version i'm currently on 0.10.26, will come back and let you know if that solves the issue!
Thanks! latest is v0.10.29
Ope yea i think going to version 0.10.29 fixed it lol
thank you for the help, i really appreciate it
Yea for sure! It was actually this thread that made me remember to add that sanity check a few days ago lol so thanks for reporting!
No problem, happy to do my part!
Add a reply
Sign up and join the conversation on Discord