how does one get the document

mmardadi

how does one get the document_id from a node created from sentence parser... i want to store this so i can delete related nodes later when reindexing... anyone have sample code?

30 comments

LLogan M

node.ref_doc_id

mmardadi

thanks 🙏

mmardadi

follow up... when i do vector_store.data.delete(doc_id), it fails to delete from the vecs where the node ref_doc_id is value passed in... no error message

LLogan M

what vector db are you using? Some have (unfortunately) not implemented the delete method properly, or aren't able to

LLogan M

oh wait

LLogan M

vector_store.data.delete() ?

LLogan M

Isn't it vector_store.delete(ref_doc_id) ?

mmardadi

supabase with vecs

mmardadi

doc_id value im sending is essentially ref_doc_id saved in db as doc_id

mmardadi

oh let me try

mmardadi

.delete without data

LLogan M

it seems like it should work, at least judging by the source code

mmardadi

so it should work with supabase / vecs?

mmardadi

it didnt but will debug again

LLogan M

it should. I can point you towards the source code if that helps

LLogan M

delete() is here
https://github.com/run-llama/llama_index/blob/4c43e681e0f59549649a015951185c4d99a59730/llama-index-integrations/vector_stores/llama-index-vector-stores-supabase/llama_index/vector_stores/supabase/base.py#L131

And its getting the nodes to delete using this helper function
https://github.com/run-llama/llama_index/blob/4c43e681e0f59549649a015951185c4d99a59730/llama-index-integrations/vector_stores/llama-index-vector-stores-supabase/llama_index/vector_stores/supabase/base.py#L113

mmardadi

here's what im doing

mmardadi

doc_id = doc.get('doc_id')

print(f"reindexing doc {doc_id}")

try:
response = vector_store.delete(doc_id)
# Attempt to delete records
print(f"vector delete response = {response}")

except Exception as e:
#catch any unexpected errors
error_response = {
'status': 'error',
'message': str(e)
}
print(json.dumps(error_response))

mmardadi

doc_id prints correctly

mmardadi

but the response on delete is "None"

mmardadi

and deletion doesnt work

mmardadi

permissions are wide open on the database

mmardadi

🤷‍♂️

mmardadi

no error

mmardadi

2024-03-01 22:15:38 documents = [{'id': 66, 'doc_id': '859796f2-dce6-45b5-b9ef-3d371e27dadf', 'doc_title': None, 'name': 'Proposal_964_Logan_Ave_22panels.pdf', 'path': 'bots/44/Proposal_964_Logan_Ave_22panels.pdf'}]
2024-03-01 22:15:38 reindexing doc 859796f2-dce6-45b5-b9ef-3d371e27dadf
2024-03-01 22:15:38 2024-03-01 22:15:38,584:ERROR - /home/helloservicedev/.virtualenvs/helloenv/lib/python3.10/site-packages/vecs/collection.py:502: UserWarning: Query does not have a covering index for cosine_distance. See Collection.create_index
2024-03-01 22:15:38 2024-03-01 22:15:38,585:ERROR - warnings.warn(
2024-03-01 22:15:38 vector delete response = None

LLogan M

maybe take a look at the source code above, or try querying the db manually to see why the source code might not work

mmardadi

sorry, back to basics question maybe...

parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

For code above, there seems to be lots of different doc_id and ref_doc_id even if theres only 1 file uploaded, depending on doc size. how do itie them together... is there something built in? or am i doing somehting wrong?

LLogan M

For pdfs, by default, they get split per page. Other data types may also have some other form of splitting into documents.

Then the node parser splits them further

mmardadi

thank you for your help... here's what i learned...

Doc_id is common for nodes and delete works when i use the "from_documents" syntax...

the commented out nodes route where i was saving nodes directly to index vs document resulted in different doc_ids so deleting becomes challenging and requires more custom tracking in my own code or via metadata.

for document in documents:
document.metadata = {"user_id": user_id, "bot_id": bot_id, "filename": filename, "bot_doc_id" : document.doc_id}
print(f"metadata = {document.metadata}")

index = VectorStoreIndex.from_documents(documents, storage_context=vector_storage_context, text_splitter=parser)

# nodes = node_parser.get_nodes_from_documents(documents)
# docid = documents[0].id
# for node in nodes:
# node.metadata = {"user_id": user_id, "bot_id": bot_id, "filename": filename, "bot_doc_id" : doc_id }
# print(f"metadata = {node.metadata}")

index = VectorStoreIndex(documents, storage_context=vector_storage_context)
index.storage_context.persist()

return documents[0].doc_id

mmardadi

btw i also found your videos on youtube on document management! watched that at 5 am this morning... thank you for sharing such knowledge on youtube 🙂

Add a reply

Find answers from the community

how does one get the document_id from a