Hi All, I'm using the ingestion pipeline with pgvector. I noticed that if I use a docstore with that, it will cache properly and not try to insert multiple entries until I restart the process. In other words, if I work in a REPL, I can re-run the pipeline as many times as I want and I will only see one entry in my DB. However if I stop it and start a new REPL, it seems to generate a new document ID. It looks like it knows the hash of the content and could do an UPSERT, but it doesn't seem to be using it. Am I going about this incorrectly somehow? I'm happy to post example code.
for this to work well, it the document id needs to remain consistent. I would ensure you are loading your data with the same document ids each time, otherwise the hash lookup will fail
Also, if you're in the codebase, I made two pull requests for updating some incorrect comment and a broken example in docs. I'm happy to add an example too, as I figured how how to get postgres working as a docstore, which there isn't an example for yet.