Do we need to specify the embedding function when loading persisted collections from chromadb? Based on the guidance from here and the docs, I was using the following to load chroma collections for use with vector_stores. vectordb = chromadb.PersistentClient(path="some/path/here")
chroma_collection = vectordb.get_collection('collection_name') # <-- can we/should we specify an embedding function here? I hadn't noticed in docs
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)
The reason I'm asking, is I tried to access the chroma collection to run some queries to try and figure out why the query_engine is doing so poorly for me and when I tried to run the query from the 'chroma_collection' object, it defaulted to the chromadb default embedding which is not OpenAIEmbedding. For example, I tried: data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances']) Running the above generated an error indicating that the embedding dimensions between the query and the collection didn't match (350 vs 1536). So I next loaded the chroma collection and then passed an embedding function to the chroma "get_collection()" function. Once I did that, I was able to query the chroma collection as expected. from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=openai.api_key, model_name="text-embedding-ada-002")
collection = vectordb.get_collection(name='msrc_security_update', embedding_function=openai_ef)
data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances']) Normally, the embedding function is set by the service_context..
I'm just wondering if not specifying the embed function of the chroma collection at load time is causing issues even though I set the embed function later in the service context: service_context = ServiceContext.from_defaults(embed_model=embed_model,
callback_manager=callback_manager,
node_parser=node_parser)
If you are using the chroma api directly and not llamaindex, then you need to configure it to use the same embedding model that was used to create the vector index