Find answers from the community

Updated 2 years ago

Doc ID

I had the same problem when uploading Nodes to Pinecone. As far I understand, converting Document to Node can result in an 1:n relationship due to text chunking. In my situation the problem is that llama_index.node_parser.node_utils.get_nodes_from_document will not use Document.doc_id but auto-generate Node.doc_id. Right now I resolved the problem by overwriting this function and defining a custom Node-parser enforcing to set Node.doc_id equal to Document.doc_id. This works because my Document is already split and I keep a 1:1 relationship between Document and Node.
L
J
10 comments
Just trying to follow a bit

You are right, there is a 1:n relationship between a document and its chunked nodes

However, all chunked nodes will inherit the doc ID of the parent, which is set as a ref doc ID

Does this not maintain the mapping?

I thiiiink it might cause issues to have duplicate doc ids in an index, you might be overwriting nodes due to that πŸ€”
I am not quite sure if I use the concepts as intended but this is what I am trying:
Plain Text
target_path = "/home/j/Tests/llama_output"
  openai_chat = LLMPredictor(llm=ChatOpenAI(model_name="gpt-3.5-turbo"))
  pinecone_vs = PineconeVectorStore(index_name="test-test", environment="us-central1-gcp")
  dd_docstore = DeepdoctectionDocumentStore("/home/j/Tests/docstore/db.json", target_path)
  storage_context = StorageContext.from_defaults(vector_store=pinecone_vs, docstore=dd_docstore)
  service_context = ServiceContext.from_defaults(llm_predictor=openai_chat)
  gpt_index = GPTVectorStoreIndex(nodes=[], storage_context=storage_context, service_context=service_context)
  query_engine = gpt_index.as_query_engine(similarity_top_k=5)
  out = query_engine.query("Was are the market risks?")
DeepdoctectionDocumentStore is derived from KVDocumentStore because I cannot load my Document s into memory and only work with their meta data.
Nodes have been previously uploaded to Pinecone. There is a 1:1 correspondence between Document and Node in my case. But doc_ids are not the same.
The problem here is that VectorIndexRetriever is looking for a Node with its doc_id but cannot find a corresponding Document because doc_id do not match.
I am wondering what am I supposed to save in the DocumentStore: Node or Document ?
Yea the doc store is confusingly named, but only nodes go in it πŸ˜…
Now it makes sense πŸ˜ƒ
Great! :dotsCATJAM:
Add a reply
Sign up and join the conversation on Discord