Find answers from the community

Updated 7 months ago

delete nodes from vector

I am trying to delete nodes from my vector index based on the file from which those nodes are made like i want to delete a specific document nodes. I found this in documentation
Plain Text
delete_ref_doc(ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any) -> None
but dont know how to use it or is it even correct way.
A
S
L
38 comments
instantiate the vector store, call the method deleteor adeleteon it with the node ID
I have similar code..
but i dont have a node id all I have is pdf name that I want to remove from vector store.

One way i can think of is iterate through whole nodes and check if file_name in matadata of that node matches with my pdf name that I want to delete if yes then delete that node
How are you adding the documents into the vector store?
The ingestion process
When you add documents, the document ids related to that document are returned
In my case I store these ids to delete them
Plain Text
documents = SimpleDirectoryReader(
    input_dir="/content/handbook-bge-embeddings/docs"
).load_data()

vector_index = VectorStoreIndex.from_documents(documents)

I am using this code to create vector index from pdf files which are in dir docs. then persisting the vector index,

How can I store doc id in this process?
Also after creating vector index if i want to add new doc to this vector index i am using this -
Plain Text
nodes = parser.get_nodes_from_documents(documents2)
vector_index.insert_nodes(nodes)
ok i also found this setting filename as id while creating vector index -
Plain Text
documents = SimpleDirectoryReader("./data", filename_as_id=True).load_data()
but now for 1 pdf file doc ids are like file_name_part1.... I thought for 1 pdf file there will be only 1 doc id but thats not the case.
You're using their high level API
I don't know what from_documents return, tbh
What I do is I separate ingestion phase from QA phase
Using a low level API
I call the vector_stor.add(nodes) and this method returns the IDs of the nodes inserted for later removal
I don't see the way they do as production code... idk
not going to read this whole thread lol but detele_ref_doc deletes by input document ID

Documents are broken into many nodes, and it will delete all nodes associated with a parent id
Didn't know this
Yes, several nodes
But if I place one single node (random) that is related to a document will it work?
A random node ID
Will it delete the other nodes?
Currently I store all the node ids in a separate database and delete one by one
Assuming node.ref_doc_id points to the parent document, it will work fine
well I have doc name(name of the pdf file) I want to remove all nodes associated with that doc from vector index how can I do that?
I dont know its doc id
Harder to do. Depends on how you inserted the document in the vector store. What if you have 2 documents with the same name?
that will not be the case and if it is then delete both
What is your vector index?
Add a reply
Sign up and join the conversation on Discord