Hi I'm trying to use Llama Index for insertions and updates to a Weaviate database but I'm having an issue with understanding a few things When creating the Document object, I initialize it with document_object = Document(text = "my_text", doc_id = "my_doc_id", extra_info = extra_info) . I then use index = GPTVectorStoreIndex.from_documents(all_docs, storage_context=storage_context) to initially insert the documents into the database However I'm noticing that the doc_id that I send here is being stored in a ref_doc_id property in the Weaviate class, and the doc_id property being stored in the class is something that is auto-generated. This is a problem because I can't keep track of which chunks of the document I have inserted. In addition, this also means that I can't control if there are duplicate inserts. Is there any way to override the doc_id that is generated?
Cc @disiok hmm, yea I'm not sure if we have good enough checking around duplicate inserts right.
But yea, each document is split into many nodes. The mapping from nodes to original document is done using the ref doc id. I'm pretty sure each doc_id needs to be unique
Got it but the issue with this design imo is that the user has no access to the final generated doc_ids for the nodes at insertion time. This means there's no way for me to track which doc_ids have been inserted
What is happening here is when we do GPTVectorStoreIndex.from_documents , we actually try to chunk up the Document, and create Node objects out of it (and assign a new UUID for each Node)