Find answers from the community

Updated 9 months ago

how does one get the document_id from a

how does one get the document_id from a node created from sentence parser... i want to store this so i can delete related nodes later when reindexing... anyone have sample code?
L
m
30 comments
node.ref_doc_id
thanks πŸ™
follow up... when i do vector_store.data.delete(doc_id), it fails to delete from the vecs where the node ref_doc_id is value passed in... no error message
what vector db are you using? Some have (unfortunately) not implemented the delete method properly, or aren't able to
vector_store.data.delete() ?
Isn't it vector_store.delete(ref_doc_id) ?
supabase with vecs
doc_id value im sending is essentially ref_doc_id saved in db as doc_id
.delete without data
it seems like it should work, at least judging by the source code
so it should work with supabase / vecs?
it didnt but will debug again
it should. I can point you towards the source code if that helps
here's what im doing
doc_id = doc.get('doc_id')

print(f"reindexing doc {doc_id}")

try:
response = vector_store.delete(doc_id)
# Attempt to delete records
print(f"vector delete response = {response}")


except Exception as e:
#catch any unexpected errors
error_response = {
'status': 'error',
'message': str(e)
}
print(json.dumps(error_response))
doc_id prints correctly
but the response on delete is "None"
and deletion doesnt work
permissions are wide open on the database
πŸ€·β€β™‚οΈ
2024-03-01 22:15:38 documents = [{'id': 66, 'doc_id': '859796f2-dce6-45b5-b9ef-3d371e27dadf', 'doc_title': None, 'name': 'Proposal_964_Logan_Ave_22panels.pdf', 'path': 'bots/44/Proposal_964_Logan_Ave_22panels.pdf'}]
2024-03-01 22:15:38 reindexing doc 859796f2-dce6-45b5-b9ef-3d371e27dadf
2024-03-01 22:15:38 2024-03-01 22:15:38,584:ERROR - /home/helloservicedev/.virtualenvs/helloenv/lib/python3.10/site-packages/vecs/collection.py:502: UserWarning: Query does not have a covering index for cosine_distance. See Collection.create_index
2024-03-01 22:15:38 2024-03-01 22:15:38,585:ERROR - warnings.warn(
2024-03-01 22:15:38 vector delete response = None
maybe take a look at the source code above, or try querying the db manually to see why the source code might not work
sorry, back to basics question maybe...

parser = SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

For code above, there seems to be lots of different doc_id and ref_doc_id even if theres only 1 file uploaded, depending on doc size. how do itie them together... is there something built in? or am i doing somehting wrong?
For pdfs, by default, they get split per page. Other data types may also have some other form of splitting into documents.

Then the node parser splits them further
thank you for your help... here's what i learned...

Doc_id is common for nodes and delete works when i use the "from_documents" syntax...

the commented out nodes route where i was saving nodes directly to index vs document resulted in different doc_ids so deleting becomes challenging and requires more custom tracking in my own code or via metadata.

for document in documents:
document.metadata = {"user_id": user_id, "bot_id": bot_id, "filename": filename, "bot_doc_id" : document.doc_id}
print(f"metadata = {document.metadata}")

index = VectorStoreIndex.from_documents(documents, storage_context=vector_storage_context, text_splitter=parser)

# nodes = node_parser.get_nodes_from_documents(documents)
# docid = documents[0].id
# for node in nodes:
# node.metadata = {"user_id": user_id, "bot_id": bot_id, "filename": filename, "bot_doc_id" : doc_id }
# print(f"metadata = {node.metadata}")

index = VectorStoreIndex(documents, storage_context=vector_storage_context)
index.storage_context.persist()

return documents[0].doc_id
btw i also found your videos on youtube on document management! watched that at 5 am this morning... thank you for sharing such knowledge on youtube πŸ™‚
Add a reply
Sign up and join the conversation on Discord