Find answers from the community

Updated 3 months ago

I have prototyped a retrieval system

I have prototyped a retrieval system using the basic SimpleStore classes. I have persisted the data to disk. Now, I want to copy all that data into corresponding RedisStores. How would I do this?

For example, Here are my two storage contexts:

Simple:
Plain Text
StorageContext(docstore=<llama_index.core.storage.docstore.simple_docstore.SimpleDocumentStore object at 0x1497fd570>, index_store=<llama_index.core.storage.index_store.simple_index_store.SimpleIndexStore object at 0x1497fde40>, vector_stores={'image': <llama_index.core.vector_stores.simple.SimpleVectorStore object at 0x1497fe890>, 'default': <llama_index.core.vector_stores.simple.SimpleVectorStore object at 0x1497fe9e0>}, graph_store=<llama_index.core.graph_stores.simple.SimpleGraphStore object at 0x1497fde70>)


Redis:
Plain Text
StorageContext(docstore=<llama_index.storage.docstore.redis.base.RedisDocumentStore object at 0x1497fe4d0>, index_store=<llama_index.storage.index_store.redis.base.RedisIndexStore object at 0x1497fe110>, vector_stores={'default': RedisVectorStore(stores_text=True, is_embedding_query=True, stores_node=True, flat_metadata=False), 'image': <llama_index.core.vector_stores.simple.SimpleVectorStore object at 0x1497fe5c0>}, graph_store=<llama_index.core.graph_stores.simple.SimpleGraphStore object at 0x1497fe4a0>)
L
D
P
25 comments
Its pretty hard to do. It is possible, but imo I would just re-create it in redis, its not really worth the effort to dig into the low-level APIs here πŸ˜…
Yikes, I assumed that since the stores are abstracted, it should be pretty easy to move data from one implementation to another...

Why do all the document parsing & embedding work (some of which is non-trivial, for instance, in the case of videos & images) every time you want to move where your indexes are stored?!
Also, the fact taht the SimpleStores are serializable to disk means that they could be used as an easily transferable "backup" of the data to any implementation.
I think people moving where there indexes are stored is not that common πŸ˜…

I didn't say it was impossible though. Basically it would look like this (I think)

Plain Text
nodes = list(index.docstore.docs)
for node in nodes:
  node.embedding = index.vector_store.get(node.node_id)

vector_store.add(nodes)
index = VectorStoreIndex.from_vector_store(vector_store)


The non-simple vector stores store the nodes and embeddings in the vector store, so thats all you need
nice, and i assume that if you have a docstore too, you would call add_documents on that as well?
or should the add on the "new" vector_store add to the docstore as well, (if it exists)?
So, with most vector db integrations, the docstore is optional, since all the nodes are stored in the vector store already
So its not really used unless you have a need to look up your chunks easily by key/val
wouldn't the docstore be useful for upsert style behaviors? Isn't that where the document hash <-> doc_id is stored?
you got it, thats the other time where its handy
In general I prefer the flow of attaching the docstore and vector store to an ingestion pipeline
Plain Text
pipeline = IngestionPipeline(
  transformations=[SentenceSplitter(), OpenAIEmbedding()], 
  docstore=docstore, 
  vector_store=vector_store
)

pipeline.run(documents=documents)
This handles the upserts for you
oh very cool, I'll read up on IngestionPipelines. thanks
It looks like perhaps I could build a "transfer documents" IngestionPipeline for transfering from the Simple stores to, for example, the Redis stores.
I built an ingestion pipeline with postgres, and it seems the doc store properly updates when a file is modified, but a new entry is added to the vector store every time a file is updated...thoughts?

Alternatively can I just get rid of the doc store and handle upserts another way?

Using PostgresDocumentsStore fwiw
Do you have consistent doc_ids on the input? It works fine for me

You can definitely handle upserts another way though, if you had something in mind
Yeah I thought I was using the file name as id, let me double check I didn't mess that up
So it's not consistent, but I'm not quite sure why... Working though it
So the end of the script is..
Plain Text
node.run(documents=documents)
index = vectorstoreindex(nodes, storage_context=storage_context)

Is that messing with it? Seems node.run is all that should be required, as I define the doc and vector store in the pipeline.

When I try it like the docs vector_store doesn't get populated: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/#connecting-to-vector-databases
πŸ€” That should be whats needed there (assuming you didn't attach the vector store to the pipeline)
Im probably missing the full picture here
Yeah sorry for not sending the full code, typing on phone. I'll keep playing with it. If I have more issues I'll give you the full picture. Thanks a bunch as always!
sounds good! :dotsCATJAM:
Add a reply
Sign up and join the conversation on Discord