chroma quality is so bad how to improve

At a glance

The community member is working on a document Q&A project and was previously using GPTTreeIndex, which they found to be very good. However, they have now switched to Chroma vector store, and the results are unsatisfactory. They are seeking advice on how to improve the quality of the Chroma results, such as changing the embedding or feeding the collection to GPTTreeIndex.

The community members discuss various approaches to improve the Chroma vector store, including adjusting the response_mode and similarity_top_k settings, as well as the chunk size when creating the index. They also explore the possibility of using a tree index, but note that it does not use vectors by default and may require persisting the index to a remote storage solution like S3, Google Bucket, or MongoDB.

Useful resources

jjma7889

chroma quality is so bad, how to improve? Hi folks, I am working on a document q&a project. I was using GPTTreeIndex with child branch factor of 2 before, it was very good. Now I switched to chroma vector store. The result is so bad that chroma results look like from an idiot. Does anyone know how to improve it? e.g. change embedding, or feed the collection to GPTTreeIndex? I did try top k from 1 to 5, the results are more or less like from idiot number 1 to idiot number 5, with minor improvements only. For both indexes, I use the same model settings with gpt3.5

11 comments

LLogan M

How a tree index vs. vector index works is very different

A tree index more or less ends up reading the entire index

You can kind of mimic a tree index with a vector index by setting the response_mode

Try something like this

Plain Text

response = index.as_query_engine(response_mode="tree_summarize", similarity_top_k=5).query("...")

Besides playing with top_k and response_mode, you can also try adjusting the chunk size when creating the index

jjma7889

@Logan M Thanks, I will try that, do you happen to know where can i set chunk size for chroma? Also, chroma refuse to load from disk, even though it saves to disk. I use this context

Plain Text

           chroma_client = chromadb.Client(
                Settings(
                  chroma_db_impl="duckdb+parquet",
                    persist_directory=CHROMA_STORAGE_PATH))

LLogan M

chroma automatically persists

In the latest versions loading is pretty easy. To "load" it, just setup the vector store and do index = VectorStoreIndex.from_vector_store(vector_store) -- this will load from an existing vector store you created (only used for vector store integrations)

you can set the chunk size in the service context.

Default is 1024

Plain Text

service_context = ServiceContext.from_defaults(chunk_size=1024)
index = VectorStoreIndex.from_documents(documents, service_context=service_context, storage_context=storage_context)

jjma7889

@Logan M thanks i will give them a try

jjma7889

@Logan M I tried the tree_summarize mode with k=4, it has very limited improvement. I suspect I have to use a tree index. Do you know if any production quality vector store support tree index? Or, maybe I should create a GPTTreeIndex out of a vector store collect / documents?

LLogan M

The tree index does not use vectors actually (at least, not with default settings).

It builds a buttom up tree of the entire index, where every parent node summarizes it's children nodes.

At query time, it traverses the tree, selecting the path that best matches the query

jjma7889

hmm, then seems I have to live with persisted tree index for now. they are not as easy to manage as with some sort of database interface. but other than that, they work very well in my use case

LLogan M

You can persist them to S3 or a google bucket fairly easily if that helps

LLogan M

https://gpt-index.readthedocs.io/en/latest/how_to/storage/save_load.html#using-a-remote-backend

aafewell

I dont know for sure, but it looks to me like you could also use mongo to store the tree index

LLogan M

That too! I always forget about mongo lol

Add a reply

Find answers from the community

chroma quality is so bad how to improve