Document
to Node
can result in an 1:n relationship due to text chunking. In my situation the problem is that llama_index.node_parser.node_utils.get_nodes_from_document
will not use Document.doc_id
but auto-generate Node.doc_id
. Right now I resolved the problem by overwriting this function and defining a custom Node-parser enforcing to set Node.doc_id
equal to Document.doc_id
. This works because my Document
is already split and I keep a 1:1 relationship between Document
and Node
.target_path = "/home/j/Tests/llama_output" openai_chat = LLMPredictor(llm=ChatOpenAI(model_name="gpt-3.5-turbo")) pinecone_vs = PineconeVectorStore(index_name="test-test", environment="us-central1-gcp") dd_docstore = DeepdoctectionDocumentStore("/home/j/Tests/docstore/db.json", target_path) storage_context = StorageContext.from_defaults(vector_store=pinecone_vs, docstore=dd_docstore) service_context = ServiceContext.from_defaults(llm_predictor=openai_chat) gpt_index = GPTVectorStoreIndex(nodes=[], storage_context=storage_context, service_context=service_context) query_engine = gpt_index.as_query_engine(similarity_top_k=5) out = query_engine.query("Was are the market risks?")
DeepdoctectionDocumentStore
is derived from KVDocumentStore
because I cannot load my Document
s into memory and only work with their meta data.Node
s have been previously uploaded to Pinecone. There is a 1:1 correspondence between Document
and Node
in my case. But doc_id
s are not the same.VectorIndexRetriever
is looking for a Node
with its doc_id
but cannot find a corresponding Document
because doc_id
do not match.