The community member is asking whether they need to specify the embedding function when loading persisted collections from ChromaDB. They encountered an issue where the default ChromaDB embedding was not the OpenAI embedding they expected, causing a mismatch in embedding dimensions and errors when querying the collection.
The community member resolved the issue by explicitly passing the OpenAI embedding function when getting the collection from ChromaDB. They note that normally the embedding function is set by the service context.
In the comments, another community member suggests that not specifying the embedding function at load time may be causing issues, even though it was set in the service context. Another comment indicates that LlamaIndex always uses the embeddings in the service context, and if using the Chroma API directly, the embedding model must be configured to match the one used to create the vector index.
There is no explicitly marked answer in the post or comments.
Do we need to specify the embedding function when loading persisted collections from chromadb? Based on the guidance from here and the docs, I was using the following to load chroma collections for use with vector_stores. vectordb = chromadb.PersistentClient(path="some/path/here")
chroma_collection = vectordb.get_collection('collection_name') # <-- can we/should we specify an embedding function here? I hadn't noticed in docs
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)
The reason I'm asking, is I tried to access the chroma collection to run some queries to try and figure out why the query_engine is doing so poorly for me and when I tried to run the query from the 'chroma_collection' object, it defaulted to the chromadb default embedding which is not OpenAIEmbedding. For example, I tried: data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances']) Running the above generated an error indicating that the embedding dimensions between the query and the collection didn't match (350 vs 1536). So I next loaded the chroma collection and then passed an embedding function to the chroma "get_collection()" function. Once I did that, I was able to query the chroma collection as expected. from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=openai.api_key, model_name="text-embedding-ada-002")
collection = vectordb.get_collection(name='msrc_security_update', embedding_function=openai_ef)
data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances']) Normally, the embedding function is set by the service_context..
I'm just wondering if not specifying the embed function of the chroma collection at load time is causing issues even though I set the embed function later in the service context: service_context = ServiceContext.from_defaults(embed_model=embed_model,
callback_manager=callback_manager,
node_parser=node_parser)
If you are using the chroma api directly and not llamaindex, then you need to configure it to use the same embedding model that was used to create the vector index