Find answers from the community

Updated 3 months ago

Do we need to specify the embedding

Do we need to specify the embedding function when loading persisted collections from chromadb? Based on the guidance from here and the docs, I was using the following to load chroma collections for use with vector_stores.
vectordb = chromadb.PersistentClient(path="some/path/here") chroma_collection = vectordb.get_collection('collection_name') # <-- can we/should we specify an embedding function here? I hadn't noticed in docs vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)
The reason I'm asking, is I tried to access the chroma collection to run some queries to try and figure out why the query_engine is doing so poorly for me and when I tried to run the query from the 'chroma_collection' object, it defaulted to the chromadb default embedding which is not OpenAIEmbedding. For example, I tried:
data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances'])
Running the above generated an error indicating that the embedding dimensions between the query and the collection didn't match (350 vs 1536). So I next loaded the chroma collection and then passed an embedding function to the chroma "get_collection()" function. Once I did that, I was able to query the chroma collection as expected.
from chromadb.utils import embedding_functions openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=openai.api_key, model_name="text-embedding-ada-002") collection = vectordb.get_collection(name='msrc_security_update', embedding_function=openai_ef) data = chroma_collection.query(query_texts = 'some string', n_results=5, where_document={'$contains': 'CVE-2023-4351'}, include=['metadatas', 'distances'])
Normally, the embedding function is set by the service_context..
t
L
3 comments
I'm just wondering if not specifying the embed function of the chroma collection at load time is causing issues even though I set the embed function later in the service context:
service_context = ServiceContext.from_defaults(embed_model=embed_model, callback_manager=callback_manager, node_parser=node_parser)
LlamaIndex does not use the embeddings offered by any vector index, it will always use the embeddings in the service context
If you are using the chroma api directly and not llamaindex, then you need to configure it to use the same embedding model that was used to create the vector index
Add a reply
Sign up and join the conversation on Discord