Inserting documents

At a glance

Hi I'm trying to use Llama Index for insertions and updates to a Weaviate database but I'm having an issue with understanding a few things
When creating the Document object, I initialize it with
document_object = Document(text = "my_text", doc_id = "my_doc_id", extra_info = extra_info) . I then use index = GPTVectorStoreIndex.from_documents(all_docs, storage_context=storage_context) to initially insert the documents into the database
However I'm noticing that the doc_id that I send here is being stored in a ref_doc_id property in the Weaviate class, and the doc_id property being stored in the class is something that is auto-generated. This is a problem because I can't keep track of which chunks of the document I have inserted. In addition, this also means that I can't control if there are duplicate inserts. Is there any way to override the doc_id that is generated?

5 comments

LLogan M

Cc @disiok hmm, yea I'm not sure if we have good enough checking around duplicate inserts right.

But yea, each document is split into many nodes. The mapping from nodes to original document is done using the ref doc id. I'm pretty sure each doc_id needs to be unique

ssusa

Got it but the issue with this design imo is that the user has no access to the final generated doc_ids for the nodes at insertion time. This means there's no way for me to track which doc_ids have been inserted

ddisiok

Hey @susa thanks for the feedback.

I think one way to fix this is to create Node objects, and build the index with GPTVectorStoreIndex(nodes=nodes, service_context=service_context)

ddisiok

What is happening here is when we do GPTVectorStoreIndex.from_documents , we actually try to chunk up the Document, and create Node objects out of it (and assign a new UUID for each Node)

ddisiok

if you already have prepared your data, you can skip over this step, to allow for more control over the ID of the Node

Add a reply

Find answers from the community

Inserting documents