Find answers from the community

Updated 3 months ago

Documents

Give an ingestion pipeline with Vector+DocStore+IngestionCache with DocstoreStrategy=UPSERTS over Documents in a recursive directory.

If I run this same ingestion pipeline with Documents = 1 single file what would occur?

Will the other docs be deleted since (I know UPSERTS usually is just UPDATE+INSERT) but just checking.

If the single document file existed in the full processing, will it recognized and only perform the update.

---------------------

Similar question, if i wanted to run a completely different source of documents like youtube transcripts into the same Vector Collection would both ingestion pipelines be able to work without stepping on each others embeddings.
L
c
10 comments
The deduplicating is done at the document level. So running again with everything stuffed in a single document file would cause it to insert again
Not sure what you mean by stepping on each other's embeddings
Sorry not everything in one document. Just taking one of the documents from the directory and rerunning it byitself
I think based on your answer that situation would be fine
Oh, in this case it would be skipped if it is unchanged
My scenario is that I have a main processing that does the whole directory. It is time consuming to rerun just based on amount of documents.

If someone upoads a document, i want to trigger it to run just for that one document as a secondary process
I think this will work well based on your message. Thanks Logan
A tip for anyone else building this in the future.

https://filebrowser.org/installation

Has hooks to process scripts on file upload. I believe it will work well for doc management.
@Logan M - This probably warrants a seperate question but setup above gives context.

How would you manage the removal of a document in DocStore/Vector/Cache.

I can do this manually from the VectorStore by using search but then that leaves the DocStore with reference to a file that no longer exist.

Does llamaindex assist with this or does one need to manually call into the docstore/vectorstore and do the manual delection based on a file being deleted
I think you'll need to call delete manually on the other services yea πŸ€”
Add a reply
Sign up and join the conversation on Discord