Find answers from the community

s
F
Y
a
P
Updated 9 months ago

Storage

Iam trying alot of things now and i was wondering wich kinds of indexes can i store in pg ? I saw the documentation for vectordb but what about docstore, summaruindexes and so forth?
L
h
o
29 comments
Postgres currently only acts as a vector db.

So
  • no docstore
  • no index store
Okey so from what i can see redis or mongo is the way to go?
Is there something i cant store in redis or mongo ? I am looking for a one storage to rule them all πŸ™‚
redis or mongo would be the one to use then. Imo redis probably has the easiest setup πŸ™‚
Thats what iam going for then atleast for now
One more thing, is there somethings when it comes to persitance and indexing/ storage that wont work right now with a redis solution ( except saving streamed chat responses in the chat store, that we talked about in another thread)?
nope, it should all be working πŸ€”
I did test alot but ive might have done it wrong though i wrote some in the other thread ( information on chatstore in general)
@Logan M -- do you mind just expanding on what you mean by "no docstore" and "no index store"? My understanding is what you mean is that there is not code in LlamaIndex to make those those features work on top of pgvector, or am I missing something deeper?

@galogarciaiii πŸ‘€
right, there's no PGVectorDocstore or PGVectorIndexStore in llama-index. Not saying there couldn't be either, but imo throwing so much text into a sql db feels kind of... bad?

Open to contributions though, I imagine using sql-alchemy it would be easy to create a generic DatabaseDocstore even
Got it, that makes sense. Based on my (maybe) faulty memory Postgres has pretty efficient blob/JSON storage, and basically the inlined data type is just a pointer to an address on disk, so like if it's a case of just jamming JSON into Postgres I think that's fairly innocuous?
(i.e. like your rows don't end up having some huge memory footprint for these datatypes, they just point at the blob, the downside being that you can't efficiently do filter operations on them or like build clustered indices on those)
TI(re)L that Postgres does not in fact have clustered indexes, but the point stands since indexing the contents of a blob in a heap... doesn't make sense at all
Ah I see, I think that makes sense. Since it just has to act as a key-val store, maybe that will work fine?
https://www.postgresql.org/docs/current/datatype-json.html

woah, there is actually indexing support, although obviously it's a bit jank since there isn't any schema enforcement on the blob
Since it just has to act as a key-val store, maybe that will work fine?
My (very naive) impression is that the idea would be that the nodes would be stored as their JSON indexed by the vector representation. I guess this gets a bit tricky because you need to run certain operations on the database in order to set the schema up correctly to ensure the schema is setup correctly.
This is strongly preferrable in my view for our application because entries in our DB can be created in standard CRUD ways, in fact that might be much of the user interaction stuff, but then we have this LLM powered piece that is doing work on behalf of the user in the background and needs access to the same data indexed by vector representation.
Maybe I'm missing what you are proposing, but let me take a step back and explain how it currently works in llama-index
  • PGVectorStore - stores the embedding, metadata, as well as a serialized JSON of the node. By default, since the vector store has stores_text=True as an attribute, the docstore and index store are not used when PGVectorStore is used
  • The docstore - stores serialized nodes, as well as metadata about data inserted that is useful for managing document upserts
  • The index store - basically just stores the node-ids available to an index
So if you wanted to use postgres for all of this, you would need to
  • stop serializing the node in the vector store class
  • implement a docstore and index store class for postgres
  • either set store_nodes_override=True in the VectorStoreIndex constructor so that the docstore/index store are used
but let me take a step back and explain how it currently works in llama-index
Very helpful, thank you for doing this!
By default, since the vector store has stores_text=True as an attribute, the docstore and index store are not used when PGVectorStore is used
If I am interpreting this correctly we don't need to use the docstore and index store since this can just be materialized from the vector database, in this case PG?
Mostly correct. The real advantage of the docstore is it can act as a document management layer on top of a vector db, so that upserts can actually happen properly.

For example, in the ingestion pipeline, the docstore is keeping track of IDs and hashes (but not the actual text content), so that you can properly upesrt into a vector db
https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html

The reason an extra layer is needed, is because documents are always chunked into nodes, so we need to keep track of when an input document contents has changed (this is what the docstore is doing)
If you don't need this functionality, then you don't need a docstore πŸ‘
Right, that makes sense, thanks again for taking the time to explain.
And the stuff around using Redis as the backing store for the index store is add a persistence layer underneath an in memory data structure that you construct from what's in your vector database (where the nodes are)?
uhh yes, I think that makes sense πŸ‘
I remain pretty confused (and consequently probably asking stupid questions) about the role the index store plays in LlamaIndex if you are using vector database, and I didn't find this helpful?

https://docs.llamaindex.ai/en/stable/module_guides/storing/index_stores.html
The role of the docstore and the vector database, and the additional features the docstore provides if your data is updating make perefect sense and thanks for explaninig that.
The index store is essentially not needed if you are using a vector db that supports namespaces/collections

It's really only there for our default vector store, as well as for other index types. Since it's keeping track of node ids in the simple vector store, it allows different indexes to share the same vector store

It basically contains the "structure" -- but the structure isn't very interesting for a vector index.
Okay that helps clarify things, thanks @Logan M
https://discord.com/channels/1059199217496772688/1196534914674344036/1196949952606248970

Also this proposal makes perfect sense, and we will revisit it when we need that functionality. Currently we are showing prospective customers early iterations and so "online" functionality is not needed, but will be.
Add a reply
Sign up and join the conversation on Discord