Hi Team LammaIndex,
We're playing around with using the docstore and ingestion pipeline. Two questions arise. First, we notice that when for the IngestionPipeline we pass in an IngestionCache as well as a vector store, that embeddings are saved in the IngestionCache. Is this intended behaviour?
Moreover, when using this simple snippet to upload documents, add it to the doc_store and run i through the pipeline, it results in more documents/nodes than files uploaded. We see that these are split per page. However, how can we identify the relationships between these nodes, so that we can return general file id's, and add all related nodes later to an index?
async def add_documents(
files: List[UploadFile] = File(...),
) -> List[str]:
try:
print(len(files))
with tempfile.TemporaryDirectory() as tempdir:
for file in files:
with open(f"{tempdir}/{file.filename}", "wb+") as buffer:
shutil.copyfileobj(file.file, buffer)
reader = SimpleDirectoryReader(tempdir)
documents = reader.load_data(show_progress=True)
for document in documents:
print(document.get_node_info)
print(document.ref_doc_id)
print(len(documents))
mongodb_docstore.add_documents(documents)
await pipeline.arun(documents=documents)
return [document.doc_id for document in documents]
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
So we can see that len files is the amount of documents uploaded to the route. Then we can see in the show_progress that indeed 2 files are loaded. But the for loop is executed as many times as there are pages. ref_doc_id is deprecated, and also show none. Relations in node info is also an empty object.
How to return two file id's with which we can later retrieve the nodes from the doc_store, and add them to an collection?