Or, you can roll your own management layer π
@Logan M awesome. ok, where Im at it is I store the docstore, indexstore, vectorstore through the storage context to a directory on the computer. everyday I scrap data from sources and write .json files that represent documents I need to incorporate into my llama index vector store which is the chromadb. a) do I need to explicitly store documents into the docstore? b) I'm not manually setting the doc_id when I process documents, I wasn't sure what/where of that. do i need to manually create doc_ids? Each json I have to ingest has an id key, I just assumed llama index would use that.
a) If you are using chroma, you need to set that override above, otherwise it will be using the docstore already π
b) Yea, you'll need to manually set it -- right now it's just randomly generated. You can set filename_as_id=True
if the filenames are consistent, otherwise you'll have to parse out that ID you mentioned and set it
roger dodger. thanks for the tips!
sorry Logan, I'm missing something.
I have the following, but I'm unclear how to connect the Chroma vector store to the llama index vectorstoreindex.
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore(),
vector_store=SimpleVectorStore(),
index_store=SimpleIndexStore(),
persist_dir = "./data/03_primary/"
)
vector_store = ChromaVectorStore(chroma_collection=storage_context.vector_store, persist_dir="./data/03_primary/")
ctx = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=sentence_window_parser)
sentence_index = VectorStoreIndex.from_documents(documents, service_context=ctx, storage_context=storage_context, store_nodes_override=True)
hmmm, just a small error in the code I think. It should probably look something like this
db = chromadb.PersistentClient(path="./data/03_primary/")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore(),
vector_store=vector_store,
index_store=SimpleIndexStore(),
)
ctx = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=sentence_window_parser)
sentence_index = VectorStoreIndex.from_documents(documents, service_context=ctx, storage_context=storage_context, store_nodes_override=True)
# save
index.storage_context.persist(persist_dir="./data/03_primary/")
# load
db = chromadb.PersistentClient(path="./data/03_primary/")
chroma_collection = db.get_or_create_collection("quickstart")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore.from_persist_path("./data/03_primary/docstore.json"),
vector_store=vector_store,
index_store=SimpleIndexStore.from_persist_path("./data/03_primary/index_store.json"),
)
loaded_index = VectorStoreIndex([], storage_context=storage_context, store_nodes_override=True)
a little confusing -- the UX for this could be improved haha
thank you so much, i'm going to try it out asap
I wrote that without testing haha but hopefully it works
Hi @Logan M sorry to bother you, I just have a few more questions. 1) do we only call VectorStoreIndex.from_documents() the first time create an index? In your load example, you just load the index with an empty list VectorStoreIndex([], storage_context=storage_context, store_nodes_override=True)
? 2) what is the code to add documents once the index has already been created? 3) if there is a document that's already been added and then I attempt to add another document with the same id, does the second document overwrite the first or does it create a duplicate? I have some documents that start in one collection, but after a time period they get moved to a different collection, so the text largerly stays the same but the matadata changes. thanks so much.
1) Yea, calling from_documents()
is usually done once, to parse documents into nodes and store them in the index. You could technically do from_documents([], ..)
if you really wanted
2) index.insert(document)
for each new document -- this will always insert documents, even if they are duplicate though
3) It would create a duplicate. If you are setting consistant IDs, use index.refresh_ref_docs(documents)
to either replace or insert new documents. It works by checking the hash of documents that have the same doc_id
π The hash is based on the text + metadata
can you confirm what results = loaded_index.refresh_ref_docs(data)
should contain? I've run the code 3 times on the same 10 documents and each time I print out results, its always: [True, True, True, True, True, True, True, True, True, True]
. I would have thought on the second and third runs the bools would have changed to False. I asked the chatbot on LlamaIndex and it said that True
means doc updated
and False
means doc inserted
I'm not sure if the bools mean the docs aren't being inserted after each run or that the .refresh_ref_docs() is attempting to update documents that already exist...sorry its confusing to me.
To me, that's telling me the doc was either inserted or updated each time π
I can try debugging this locally
I narrowed down the problem. I had put a try/except around my storage_context definition thinking "Permission denied" meant the index didn't exist so the except side kept running where I put empty constructors. that being said, I know the index physically exists at the location of persist_dir yet when I try the code block
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore.from_persist_path(index_params["persist_dir"]),
vector_store=vector_store,
index_store=SimpleIndexStore.from_persist_path(index_params["persist_dir"]),
)
I always get
[Errno 13] Permission denied: 'C:/projects/kedro-workbench/data/06_models'
which makes no sense to me cuz I'm looking at the chromadb files... I've been able to create and load files from all over the place in this project. Am I misunderstanding how this works?
Nah this seems like a windows issue actually π
Are you able to use WSL instead? Debugging these windows permissions is a PITA
arg I can't atm, the VM we built was set on defaults and installed trusted launch which prevents WSL from running.
may I ask one more question as it pertains to the use of chroma? we use the ChromaVectorStore
to store the documents and embeddings, but in the storage_context definition we keep pointing the docstore
to a SimpleDocumentStore
yet we point the vector_store
to the ChromaVectorStore. Why aren't we pointing the docstore to the chromadb as well? Is there some magic going on here I don't see?
db = chromadb.PersistentClient(path=index_params["persist_dir"])
chroma_collection = db.get_or_create_collection(name=index_params['id'])
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
that all makes sense.
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore(), <- why doesn't this point to Chroma? or does it?
vector_store=vector_store, <- this points to chroma
index_store=SimpleIndexStore(), <- is the index metadata stored in chroma or just to disk?
)
The docstore and index store are in a format that doesn't work with vector dbs -- they have no embeddings, just a bunch of metadata really
Hi Logan, for the life of me I can't get the chromadb vector_store to save to disk. I can create a vector_store from documents, but I can't figure out how to get the thing to actually save. When I try to load it, it just comes back empty. If there's anything else you can recommend I would be most grateful!
chroma_client = chromadb.PersistentClient(path=chroma_store_params["persist_dir"])
chroma_collection = chroma_client.get_or_create_collection(name=chroma_store_params['id'])
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore.from_persist_dir(storage_params["persist_dir"]),
vector_store=vector_store,
index_store=SimpleIndexStore.from_persist_dir(storage_params["persist_dir"]),
)
service_context = ServiceContext.from_defaults(callback_manager=callback_manager, llm=llm, embed_model=OpenAIEmbedding(embed_batch_size=50), node_parser=node_parser)
#create from scratch works fine I can print out the ref_doc_info and confirm documents are there
vector_index = VectorStoreIndex.from_documents(data, service_context=service_context, storage_context=storage_context,store_nodes_override=vector_index_params["store_nodes_override"])
# load vector from disk always loads empty, red_doc_info is always empty
vector_index = VectorStoreIndex([], storage_context=storage_context, service_context=service_context, store_nodes_override=vector_index_params["store_nodes_override"])
If that doesn't work, maybe I need to actually try running this lol
ugh, I'm so sorry, I added the persist() but loading a chroma db/collection doesn't work for me. Everything works when I create the vector_index fresh, but I can't seem to load it successfully. I can see a chroma.sqlite3 file with ~40MB and the docstore has ~5MB for 10 documents. I also updated my version of llama index to the latest just now and that had no effect. The only difference I saw was that you didn't include a service_context on your load step; I tried with and without. In the following code, I create a vector_index, persist() it, and then attempt to load it. ref_doc_info
comes back empty when I check the loaded index.
vector_index = VectorStoreIndex.from_documents(data, service_context=service_context, storage_context=storage_context,store_nodes_override=vector_index_params["store_nodes_override"])
vector_index.set_index_id(vector_index_params['vector_id'])
vector_index.storage_context.persist(persist_dir=vector_index_params["persist_dir"])
vector_index_2 = VectorStoreIndex([], storage_context=storage_context, store_nodes_override=vector_index_params["store_nodes_override"])
ref_doc_info = vector_index_2.ref_doc_info
print(ref_doc_info)
Hey Logan were you able to look into what I might be doing wrong? I'm not sure what else to try. I'm not sure if I'm failing to save or failing to load. if there's something you can suggest I'd be most grategul! thanks kindly,