Find answers from the community

Updated 2 years ago

Chroma

At a glance

The community member is having trouble integrating llama_index with Chroma. They created an index from a PDF document and persisted it in a Chroma collection, but they don't know how to read that index. The documentation suggests using ChromaReader, but the example loads data from Chroma and then creates the index again, which is not what the community member wants to do.

Another community member suggests that the community member doesn't need to use the reader, and that they should be able to use persist and load_from_storage using the same storage context configuration. They mention that the community member might need to set the persist dir for the docstore/index store in the storage context.

The community member confirms that they were able to get it working by specifying the custom persist folders for both Chroma and the storage context. However, they don't know why they would use this instead of the default llama_index storage, as they didn't see an improvement in query response times.

Another community member suggests that Chroma will help once the community member has a ton of documents, but if the vector store is under 5GB, they don't think the community member will see an improvement.

There is no explicitly marked answer.
Useful resources
Hey people. I'm having trouble understanding how to integrate llama_index with chroma. I created an index from a PDF document and persisted it in a chroma collection, but I don't know how to read that index. The documentation suggests using ChromaReader, but in the example it's loading data from Chroma and then creating the index again (in my case it should be a GPTVectorStoreIndex). Isn't the index already stored in the collection? can't I retrieve it directly, like "load_index_from_storage" does? Also, how do I create the query_vector?
Attachment
image.png
L
M
4 comments
No need to use the reader I think.

I thiiink once you setup the index initially using the vector store like this notebook
https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/ChromaIndexDemo.html

You should be able to use persist and load_from_storage using the same storage context configuration 🤔

When loading, you probably have to set the persist dir for the docstore/index store in the storage context
https://gpt-index.readthedocs.io/en/latest/how_to/storage/save_load.html#loading-data

Tbh the docs don't make this clear. I want to explore this more in the next day or two to figure out the exact steps and make better examples
Ah you're right. I tried that before, but there was something wrong in my code. Since I'm using custom persist folders for both chroma and storageContext, I had to specify that folder everywhere. Thanks! Here's the final code:
Plain Text
chroma_client = chromadb.Client(Settings(
        chroma_db_impl="duckdb+parquet",
        persist_directory=persist_dir
    ))
    chroma_collection = chroma_client.get_collection("test")
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store, persist_dir=persist_dir)
    index = load_index_from_storage(storage_context, service_context = service_context)

Still, I don't know why I would use this instead of the default llama_index storage. I was checking if it improved query response times, but it didn't.
I think chroma will help once you have a ton of documents.

But if your vector store is under 5GB or so, I don't think youll get an improvement
Glad you got it working though!!
Add a reply
Sign up and join the conversation on Discord