Find answers from the community

Updated 3 months ago

Chroma

What are some best practices for persisting indexes? I am using ChromaDB and adding documents/embeddings in a separate process (outside of Llama Index). I am interested in building composable indexes that are groups of keywords that relate to particular documents. As of right now, I have lots of separate documents (200k+) and i don't get super accurate results. My plan has been to separate these documents into different categories that have metadata associated with them so that there can be a more accurate retrieval process. With that structure, how do we store these indexes?
L
c
11 comments
Wouldn't each category be it's own chroma collection/index then?
I guess that would be one good way to structure it. In that case if I had 50 categories, id need to construct that index for every chat session.
Is there a way to store that composed index?
just storing one chroma instance at a time is really the only way

Other vector dbs have this idea of collection or index names, or namespaces, which makes storage a bit more straightforward
I think you'd only have to construct it at server startup though -- chat sessions are just dependant on the chat history
i see - thanks for advice
generally, what is the serialization of the index via .persist() and load_index_from_storage()?
Specific to chroma, does the metadata properties help improve the search for llama_index or are just embeddings used?
It's just a json. But with chroma, I think the persistence is different/automatic right?

Definitely it does, both the embeddings and the LLM will leverage the metadata. Or you can configure metadata that is only used for one or the other

https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/documents_and_nodes/usage_documents.html

https://gpt-index.readthedocs.io/en/stable/examples/metadata_extraction/MetadataExtraction_LLMSurvey.html
gotcha - considering adding metadata to each of the documents when ingesting to chromadb to see if helps with finding more relevant nodes
that worked great! thanks @Logan M
Add a reply
Sign up and join the conversation on Discord