Find answers from the community

Updated 6 months ago

Hi everyone I was wondering if someone

At a glance

The community member is using Chroma to store the index and embeddings for their documents, and is seeking guidance on how to manage the vector store and index store as new data comes in. The key points are:

- How to propagate changes in parent documents to the chunked nodes - Steps required when adding new documents - Connecting the Chroma vector store to the LlamaIndex VectorStoreIndex - Saving and loading the vector store and index store to/from disk

The community members discuss various approaches, including using the refresh_ref_docs() functionality, manually setting document IDs, and managing the docstore and index store separately from the vector store. However, there does not appear to be a clear, definitive answer provided.

Useful resources

ttheta

Hi everyone, I was wondering if someone can point me to tutorials or videos that walk through how to manage a vector store and index store once we've created it and then need to maintain it as new data comes in? The part I dont really get right now is, documents are chunked into nodes, if data changes in the parent document, how do we propagate those changes to the nodes. in my case, only the metadata of the documents change. When we add new documents what are the steps required? I'm using a chromadb to store the index and the embeddings and I'm trying to build pipelines to handle the maintenance. any tips or wisdom greatly appreciated!

23 comments

LLogan M

You essentially need an extra layer to manager the documents in the vectodb

There is some support for this in llama-index, but it requires overriding the index to also use the docstore/index_store with a kwarg. And then those are two extra files to manage on top of the vector store (although they also have remote optins like mongodb or redis)

VectorStoreIndex.from_documents(documents, store_nodes_override=True)

Then, assuming you have a consistant way to set doc ids, (e.g. filename_as_id=True in simple directory reader, or some other manual process) you can use the refresh functionality detailed here
https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/index/document_management.html#refresh

A few more details and another thread link in this discord thread
https://discord.com/channels/1059199217496772688/1149373035153993728/1149375224769413241

LLogan M

Or, you can roll your own management layer 🙂

ttheta

@Logan M awesome. ok, where Im at it is I store the docstore, indexstore, vectorstore through the storage context to a directory on the computer. everyday I scrap data from sources and write .json files that represent documents I need to incorporate into my llama index vector store which is the chromadb. a) do I need to explicitly store documents into the docstore? b) I'm not manually setting the doc_id when I process documents, I wasn't sure what/where of that. do i need to manually create doc_ids? Each json I have to ingest has an id key, I just assumed llama index would use that.

LLogan M

a) If you are using chroma, you need to set that override above, otherwise it will be using the docstore already 👍

b) Yea, you'll need to manually set it -- right now it's just randomly generated. You can set filename_as_id=True if the filenames are consistent, otherwise you'll have to parse out that ID you mentioned and set it

ttheta

roger dodger. thanks for the tips!

ttheta

sorry Logan, I'm missing something.
I have the following, but I'm unclear how to connect the Chroma vector store to the llama index vectorstoreindex.

storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
    index_store=SimpleIndexStore(),
    persist_dir = "./data/03_primary/"
)

vector_store = ChromaVectorStore(chroma_collection=storage_context.vector_store, persist_dir="./data/03_primary/")

ctx = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=sentence_window_parser)

sentence_index = VectorStoreIndex.from_documents(documents, service_context=ctx, storage_context=storage_context, store_nodes_override=True)

LLogan M

hmmm, just a small error in the code I think. It should probably look something like this

Plain Text

db = chromadb.PersistentClient(path="./data/03_primary/")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=vector_store,
    index_store=SimpleIndexStore(),
)

ctx = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=sentence_window_parser)

sentence_index = VectorStoreIndex.from_documents(documents, service_context=ctx, storage_context=storage_context, store_nodes_override=True)

# save
index.storage_context.persist(persist_dir="./data/03_primary/")

# load
db = chromadb.PersistentClient(path="./data/03_primary/")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore.from_persist_path("./data/03_primary/docstore.json"),
    vector_store=vector_store,
    index_store=SimpleIndexStore.from_persist_path("./data/03_primary/index_store.json"),
)

loaded_index = VectorStoreIndex([], storage_context=storage_context, store_nodes_override=True)

LLogan M

a little confusing -- the UX for this could be improved haha

ttheta

thank you so much, i'm going to try it out asap

LLogan M

I wrote that without testing haha but hopefully it works

ttheta

Hi @Logan M sorry to bother you, I just have a few more questions. 1) do we only call VectorStoreIndex.from_documents() the first time create an index? In your load example, you just load the index with an empty list VectorStoreIndex([], storage_context=storage_context, store_nodes_override=True) ? 2) what is the code to add documents once the index has already been created? 3) if there is a document that's already been added and then I attempt to add another document with the same id, does the second document overwrite the first or does it create a duplicate? I have some documents that start in one collection, but after a time period they get moved to a different collection, so the text largerly stays the same but the matadata changes. thanks so much.

LLogan M

1) Yea, calling from_documents() is usually done once, to parse documents into nodes and store them in the index. You could technically do from_documents([], ..) if you really wanted

2) index.insert(document) for each new document -- this will always insert documents, even if they are duplicate though

3) It would create a duplicate. If you are setting consistant IDs, use index.refresh_ref_docs(documents) to either replace or insert new documents. It works by checking the hash of documents that have the same doc_id 👀 The hash is based on the text + metadata

ttheta

can you confirm what results = loaded_index.refresh_ref_docs(data) should contain? I've run the code 3 times on the same 10 documents and each time I print out results, its always: [True, True, True, True, True, True, True, True, True, True]. I would have thought on the second and third runs the bools would have changed to False. I asked the chatbot on LlamaIndex and it said that True means doc updated and False means doc inserted I'm not sure if the bools mean the docs aren't being inserted after each run or that the .refresh_ref_docs() is attempting to update documents that already exist...sorry its confusing to me.

LLogan M

To me, that's telling me the doc was either inserted or updated each time 😅 I can try debugging this locally

ttheta

I narrowed down the problem. I had put a try/except around my storage_context definition thinking "Permission denied" meant the index didn't exist so the except side kept running where I put empty constructors. that being said, I know the index physically exists at the location of persist_dir yet when I try the code block

 storage_context = StorageContext.from_defaults(
            docstore=SimpleDocumentStore.from_persist_path(index_params["persist_dir"]),
            vector_store=vector_store,
            index_store=SimpleIndexStore.from_persist_path(index_params["persist_dir"]),
        )

I always get
[Errno 13] Permission denied: 'C:/projects/kedro-workbench/data/06_models' which makes no sense to me cuz I'm looking at the chromadb files... I've been able to create and load files from all over the place in this project. Am I misunderstanding how this works?

LLogan M

Nah this seems like a windows issue actually 😅 Are you able to use WSL instead? Debugging these windows permissions is a PITA

ttheta

arg I can't atm, the VM we built was set on defaults and installed trusted launch which prevents WSL from running.
may I ask one more question as it pertains to the use of chroma? we use the ChromaVectorStore to store the documents and embeddings, but in the storage_context definition we keep pointing the docstore to a SimpleDocumentStore yet we point the vector_store to the ChromaVectorStore. Why aren't we pointing the docstore to the chromadb as well? Is there some magic going on here I don't see?

db = chromadb.PersistentClient(path=index_params["persist_dir"])
    chroma_collection = db.get_or_create_collection(name=index_params['id'])
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

that all makes sense.

storage_context = StorageContext.from_defaults(
            docstore=SimpleDocumentStore(), <- why doesn't this point to Chroma? or does it?
            vector_store=vector_store, <- this points to chroma
            index_store=SimpleIndexStore(), <- is the index metadata stored in chroma or just to disk?
        )

LLogan M

The docstore and index store are in a format that doesn't work with vector dbs -- they have no embeddings, just a bunch of metadata really

ttheta

Hi Logan, for the life of me I can't get the chromadb vector_store to save to disk. I can create a vector_store from documents, but I can't figure out how to get the thing to actually save. When I try to load it, it just comes back empty. If there's anything else you can recommend I would be most grateful!

chroma_client = chromadb.PersistentClient(path=chroma_store_params["persist_dir"])
chroma_collection = chroma_client.get_or_create_collection(name=chroma_store_params['id'])
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
            docstore=SimpleDocumentStore.from_persist_dir(storage_params["persist_dir"]),
            vector_store=vector_store,
            index_store=SimpleIndexStore.from_persist_dir(storage_params["persist_dir"]),
        )
service_context = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=node_parser)
#create from scratch works fine I can print out the ref_doc_info and confirm documents are there
vector_index = VectorStoreIndex.from_documents(data, service_context=service_context, storage_context=storage_context,store_nodes_override=vector_index_params["store_nodes_override"])
# load vector from disk always loads empty, red_doc_info is always empty
vector_index = VectorStoreIndex([], storage_context=storage_context, service_context=service_context, store_nodes_override=vector_index_params["store_nodes_override"])

LLogan M

You'll need to explicitly save the docstore and index store

Double check this example
https://discord.com/channels/1059199217496772688/1149826247124320267/1150158078566731816

LLogan M

If that doesn't work, maybe I need to actually try running this lol

ttheta

ugh, I'm so sorry, I added the persist() but loading a chroma db/collection doesn't work for me. Everything works when I create the vector_index fresh, but I can't seem to load it successfully. I can see a chroma.sqlite3 file with ~40MB and the docstore has ~5MB for 10 documents. I also updated my version of llama index to the latest just now and that had no effect. The only difference I saw was that you didn't include a service_context on your load step; I tried with and without. In the following code, I create a vector_index, persist() it, and then attempt to load it. ref_doc_info comes back empty when I check the loaded index.

vector_index = VectorStoreIndex.from_documents(data, service_context=service_context, storage_context=storage_context,store_nodes_override=vector_index_params["store_nodes_override"])
        vector_index.set_index_id(vector_index_params['vector_id'])
        vector_index.storage_context.persist(persist_dir=vector_index_params["persist_dir"])
        vector_index_2 = VectorStoreIndex([], storage_context=storage_context, store_nodes_override=vector_index_params["store_nodes_override"])
        ref_doc_info = vector_index_2.ref_doc_info
        print(ref_doc_info)

ttheta

Hey Logan were you able to look into what I might be doing wrong? I'm not sure what else to try. I'm not sure if I'm failing to save or failing to load. if there's something you can suggest I'd be most grategul! thanks kindly,

Add a reply