Hey folks, sorry if this is a simple

At a glance

Hey folks, sorry if this is a simple question: I'm loading docs/nodes from the docstore, filtering/modifying, and would sometimes need to call adelete_document to delete the nodes from the docstore. I use the returned nodes node_id field to delete, but I am getting this error message:
list.remove(x): x not in list

Am i supposed to use a different field here and not node_id?

22 comments

LLogan M

Is that an error from your own code or from inside the framework?

kkimchinosys

@Logan M It appears to be coming from llama_index if i'm not mistaken

Plain Text

ag:dev:     await mongo_storage_context.docstore.adelete_document(
rag:dev:   File "/opt/homebrew/lib/python3.11/site-packages/llama_index/core/storage/docstore/keyval_docstore.py", line 459, in adelete_document
rag:dev:     await self._aremove_ref_doc_node(doc_id)
rag:dev:   File "/opt/homebrew/lib/python3.11/site-packages/llama_index/core/storage/docstore/keyval_docstore.py", line 427, in _aremove_ref_doc_node
rag:dev:     ref_doc_obj.node_ids.remove(doc_id)
rag:dev: ValueError: list.remove(x): x not in list

LLogan M

oh hmm. I wonder if this is because ref_doc_id is not being set properly for nodes when inserting

kkimchinosys

The way i insert nodes to the docstore is

Plain Text

storage_context.docstore.async_add_documents(
                    nodes=batch, batch_size=len(batch), allow_update=True)

Is this incorrect?

kkimchinosys

However, for deletion, i first load from docstore and get all the nodes using

Plain Text

docs = storage_context.docstore.docs.values()

And then make a unique set of the node_ids i need to delete for and use those to call deletion

kkimchinosys

@Logan M Do you think there's something wrong with the way i'm trying to delete old nodes? Might have to add a workaround as my use-cases specifically needs to "upsert" if the source was parsed and stored previously

kkimchinosys

Hi @Logan M sorry to double ping but this is currently a major issue for our app, aside from implementing our own MongoDB client for deletion, is there a native workaround for this issue with LlamaIndex?

LLogan M

You might have to give me more context/a way to duplicate this issue in a minimal example. I can't help much without having it on my end to replicate

kkimchinosys

@Logan M My apologies. We store the nodes by calling

Plain Text

storage_context.docstore.async_add_documents(
                    nodes=batch, batch_size=len(batch), allow_update=True)

We delete by first calling

Plain Text

docs = storage_context.docstore.docs.values()

to get all the nodes from the docstore, and then call this to delete them after mapping each nodes to their unique node_id and looping.

Plain Text

await storage_context.docstore.adelete_document(
                        doc_id=node_id)

It seems to be reproducible for us with PDFs but CSVs are able to delete fine (Maybe because I only store IndexNodes for CSVs). If you need more context, please let me know.

LLogan M

and I'm guessing batch is a list of nodes?

kkimchinosys

Yes, batch is a list of nodes

kkimchinosys

One thing to mention is that not all nodes fail to delete. Often times 10-20% of the nodes successfully delete but the rest fail due to the error list.remove(x): x not in list.

May need to add a try-catch block so that even if it's not provided in the ref list it still deletes what it can while logging the ones it couldn't?

LLogan M

thats why having a reproduce case helps 😅 I'm wondering how it even gets to this state.

Yes, adding a try/except is an easy bandaid

LLogan M

This works fine for me

Plain Text

from llama_index.core.schema import Document
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.storage.docstore.mongodb import MongoDocumentStore

async def main():
    docstore = MongoDocumentStore.from_host_and_port("localhost", 27017)
    
    document = Document.example()
    document.metadata = {}
    nodes = TokenTextSplitter(chunk_size=25, chunk_overlap=0)([document])

    print(f"Adding {len(nodes)} nodes to the docstore.")

    await docstore.async_add_documents(nodes, batch_size=len(nodes), allow_update=True)
    
    nodes_dict = docstore.docs

    print(f"Retrieved {len(nodes_dict)} nodes from the docstore. Now deleting them.")

    for id_, node in nodes_dict.items():
        await docstore.adelete_document(id_)

if __name__ == "__main__":
    import asyncio 
    asyncio.run(main())

LLogan M

we actually recently also made the delete logic a lot less complicated, cleaned it up, etc.

I'm wondering if you still encounter this issue on the latest version of llama-index-core

LLogan M

Because we have this checked already explicitly I see

Plain Text

if doc_id in ref_doc_obj.node_ids:  # sanity check
  ref_doc_obj.node_ids.remove(doc_id)

kkimchinosys

hmm gotcha i'll try updating my version i'm currently on 0.10.26, will come back and let you know if that solves the issue!

LLogan M

Thanks! latest is v0.10.29

kkimchinosys

Ope yea i think going to version 0.10.29 fixed it lol

kkimchinosys

thank you for the help, i really appreciate it

LLogan M

Yea for sure! It was actually this thread that made me remember to add that sanity check a few days ago lol so thanks for reporting!

kkimchinosys

No problem, happy to do my part!

Add a reply

Find answers from the community

Hey folks, sorry if this is a simple