Hi All, I'm using the ingestion pipeline

At a glance

Hi All, I'm using the ingestion pipeline with pgvector. I noticed that if I use a docstore with that, it will cache properly and not try to insert multiple entries until I restart the process. In other words, if I work in a REPL, I can re-run the pipeline as many times as I want and I will only see one entry in my DB. However if I stop it and start a new REPL, it seems to generate a new document ID. It looks like it knows the hash of the content and could do an UPSERT, but it doesn't seem to be using it. Am I going about this incorrectly somehow? I'm happy to post example code.

4 comments

LLogan M

for this to work well, it the document id needs to remain consistent. I would ensure you are loading your data with the same document ids each time, otherwise the hash lookup will fail

LLogan M

(also need to make sure you save/load the docstore)

vvalentine-rabbit

is there a tool to help ensure I get the same document id? It looks like Document just randomly assigns a uuid. Do you recommend I use something else?

vvalentine-rabbit

Also, if you're in the codebase, I made two pull requests for updating some incorrect comment and a broken example in docs. I'm happy to add an example too, as I figured how how to get postgres working as a docstore, which there isn't an example for yet.

Add a reply

Find answers from the community

Hi All, I'm using the ingestion pipeline