I have prototyped a retrieval system

Its pretty hard to do. It is possible, but imo I would just re-create it in redis, its not really worth the effort to dig into the low-level APIs here 😅

Yikes, I assumed that since the stores are abstracted, it should be pretty easy to move data from one implementation to another...

Why do all the document parsing & embedding work (some of which is non-trivial, for instance, in the case of videos & images) every time you want to move where your indexes are stored?!

Also, the fact taht the SimpleStores are serializable to disk means that they could be used as an easily transferable "backup" of the data to any implementation.

I think people moving where there indexes are stored is not that common 😅

I didn't say it was impossible though. Basically it would look like this (I think)

Plain Text

nodes = list(index.docstore.docs)
for node in nodes:
  node.embedding = index.vector_store.get(node.node_id)

vector_store.add(nodes)
index = VectorStoreIndex.from_vector_store(vector_store)

The non-simple vector stores store the nodes and embeddings in the vector store, so thats all you need

nice, and i assume that if you have a docstore too, you would call add_documents on that as well?

or should the add on the "new" vector_store add to the docstore as well, (if it exists)?

So, with most vector db integrations, the docstore is optional, since all the nodes are stored in the vector store already

So its not really used unless you have a need to look up your chunks easily by key/val

wouldn't the docstore be useful for upsert style behaviors? Isn't that where the document hash <-> doc_id is stored?

you got it, thats the other time where its handy

In general I prefer the flow of attaching the docstore and vector store to an ingestion pipeline

Plain Text

pipeline = IngestionPipeline(
  transformations=[SentenceSplitter(), OpenAIEmbedding()], 
  docstore=docstore, 
  vector_store=vector_store
)

pipeline.run(documents=documents)

This handles the upserts for you

oh very cool, I'll read up on IngestionPipelines. thanks

It looks like perhaps I could build a "transfer documents" IngestionPipeline for transfering from the Simple stores to, for example, the Redis stores.

thanks

I built an ingestion pipeline with postgres, and it seems the doc store properly updates when a file is modified, but a new entry is added to the vector store every time a file is updated...thoughts?

Alternatively can I just get rid of the doc store and handle upserts another way?

Using PostgresDocumentsStore fwiw

Do you have consistent doc_ids on the input? It works fine for me

You can definitely handle upserts another way though, if you had something in mind

Yeah I thought I was using the file name as id, let me double check I didn't mess that up

So it's not consistent, but I'm not quite sure why... Working though it

So the end of the script is..

Plain Text

node.run(documents=documents)
index = vectorstoreindex(nodes, storage_context=storage_context)

Is that messing with it? Seems node.run is all that should be required, as I define the doc and vector store in the pipeline.

When I try it like the docs vector_store doesn't get populated: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/#connecting-to-vector-databases

🤔 That should be whats needed there (assuming you didn't attach the vector store to the pipeline)

Im probably missing the full picture here

Yeah sorry for not sending the full code, typing on phone. I'll keep playing with it. If I have more issues I'll give you the full picture. Thanks a bunch as always!