Filename

PPuddingBread

Hi Team LammaIndex,
We're playing around with using the docstore and ingestion pipeline. Two questions arise. First, we notice that when for the IngestionPipeline we pass in an IngestionCache as well as a vector store, that embeddings are saved in the IngestionCache. Is this intended behaviour?
Moreover, when using this simple snippet to upload documents, add it to the doc_store and run i through the pipeline, it results in more documents/nodes than files uploaded. We see that these are split per page. However, how can we identify the relationships between these nodes, so that we can return general file id's, and add all related nodes later to an index?

Plain Text

 
async def add_documents(
    files: List[UploadFile] = File(...),
) -> List[str]:
    try:
        print(len(files))
        with tempfile.TemporaryDirectory() as tempdir:
            for file in files:
                with open(f"{tempdir}/{file.filename}", "wb+") as buffer:
                    shutil.copyfileobj(file.file, buffer)
            reader = SimpleDirectoryReader(tempdir)
            documents = reader.load_data(show_progress=True)
            for document in documents:
                print(document.get_node_info)
                print(document.ref_doc_id)
            print(len(documents))
            mongodb_docstore.add_documents(documents)
            await pipeline.arun(documents=documents)
            return [document.doc_id for document in documents]
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

So we can see that len files is the amount of documents uploaded to the route. Then we can see in the show_progress that indeed 2 files are loaded. But the for loop is executed as many times as there are pages. ref_doc_id is deprecated, and also show none. Relations in node info is also an empty object.
How to return two file id's with which we can later retrieve the nodes from the doc_store, and add them to an collection?

12 comments

WWhiteFang_Jr

You can pass filename_as_id param in the simpleDirectoryReader.

https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents.html#customizing-the-id

WWhiteFang_Jr

It will put documents related to file 1 have the same doc_id with extra iteration value

PPuddingBread

ahh yes, that last part was not clear to me. I was struggling on how file name as id would work since they then would be unique

PPuddingBread

Thanks for that part of the question!

WWhiteFang_Jr

Wait I read this now, There are more queries ?😅

PPuddingBread

First, we notice that when for the IngestionPipeline we pass in an IngestionCache as well as a vector store, that embeddings are saved in the IngestionCache. Is this intended behaviour? 😉

LLogan M

It is the intended behaviour -- the cache is simply caching the outputs of each transformation step, which is just a list of nodes (and whatever is attached to them). This way if you re-run, it can use the cached results (if the into to the embeddings step is the same) rather than re-calculating

PPuddingBread

Should we then not also pass in a vector store for the ingestion pipeline? I assume that it only becomes useful when creating a vectorindex then, which it constructs from the nodes with embeddings coming out of the ingestioncache.

LLogan M

I mean, it's fine to pass in a vector store? Like, nodes will get added to the vector store once the pipeline finishes running

The cache stores embeddings merely so that it can skip calling the API, if it encounters the same nodes.

This means your vector store might get nodes added twice, if you didn't delete/re-create the store

PPuddingBread

Sorry if I misunderstand, does this mean that if we pass in the same vector store for a pipeline and an index, that the nodes are inserted twice?
So this would be the flow: initialize doc_store, vector_store and ingestion_cache -> upload files -> run through pipeline (with vecstore and cache) and store in doc_store -> create index (with same vector_store) from nodes passed through ingestion node.
Does it now insert the nodes twice in the vector store?

LLogan M

You you got it, that would insert twice

I have a WIP PR to improve this 🙂
https://github.com/run-llama/llama_index/pull/9135

PPuddingBread

ahh I see! Thanks for taking the time and explaining it to me. I'll be looking forward to being able to pass in the doc_store :)!

Add a reply

Find answers from the community

Filename