Hi, I want to index a corpus of data and

HHoaz

Hi, I want to index a corpus of data and store it directly into chromadb instance.
But this code only genreates a storage folder and than stores it into a file instead of chromadb vector_storedb.
Can anyone help

chromadb_vs = ChromaVectorStore(chroma_collection=chromdb_collection)
print("INFO: Initializing the Service Context")
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local"
)

print("INFO: Creating Vector Store index object")
index = VectorStoreIndex.from_documents(documents=documents,vector_store=chromadb_vs,service_context=service_context,show_progress=True)

print("INFO: Writing to disk as persistance")
index.vector_store.persist()

6 comments

LLogan M

chroma persists automatically, no need to call .persist()

LLogan M

https://docs.llamaindex.ai/en/stable/examples/vector_stores/ChromaIndexDemo.html#basic-example-including-saving-to-disk

HHoaz

Thanks @Logan M but still don't get the indices stored in the chromadb file since its sqlite3 file is empty.
the code creates the chromadb file but won't store anything inside it.
This is my code:

import logging
from openai import OpenAI
from llama_index.embeddings import BaseEmbedding
from llama_index.callbacks import base_handler
from llama_index import SimpleDirectoryReader, VectorStoreIndex, StorageContext, load_index_from_storage , ServiceContext, callbacks
from llama_index.vector_stores import ChromaVectorStore
from llama_index.llms import HuggingFaceLLM
from llama_index.retrievers import VectorIndexRetriever
import os, sys
import chromadb
from chromadb.config import Settings

query_str = ""

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
llm = HuggingFaceLLM(
    model_name= model_name,
    device_map="cpu"
    )

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
print("INFO: Reading directory documents............")
documents = SimpleDirectoryReader("..\\datasets\\test").load_data(show_progress=True)

print("Initializing ChromadDB Collection")
chromdb = chromadb.PersistentClient (path="./test",settings = Settings(anonymized_telemetry=False))
chromdb_collection = chromdb.get_or_create_collection ("test")
chromadb_vs = ChromaVectorStore(chroma_collection=chromdb_collection)

print("INFO: Initializing the Service Context")
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model="local"
)

print("INFO: Creating Vector Store index object")
index = VectorStoreIndex.from_documents(documents=documents,vector_store=chromadb_vs,service_context=service_context,show_progress=True)

Thanks in advance

LLogan M

@Hoaz double check the link I sent. You missed using the storage context

HHoaz

👍🏻 🫣

rrahul

The from_documents method in the BaseIndex class does not take a vector_store argument. The arguments it accepts are:

cls: The class type.
documents: A sequence of documents to build the index from.
storage_context: An optional storage context. If not provided, it will use the default storage context.
service_context: An optional service context. If not provided, it will use the default service context.
show_progress: A boolean indicating whether to show progress or not.
**kwargs: Any additional keyword arguments.

So, the correct usage of the method would be:

Plain Text

index = BaseIndex.from_documents(documents=documents, service_context=service_context, show_progress=True)

If you need to use a specific vector store, you should set it in the storage_context before calling from_documents.

Add a reply

Find answers from the community

Hi, I want to index a corpus of data and