For context. This is my first index
from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index.vector_stores import SimpleVectorStore
from llama_index.storage import StorageContext
from llama_index.graph_stores import SimpleGraphStore
# create storage context using default stores
storage_context = StorageContext.from_defaults(
docstore=SimpleDocumentStore(),
vector_store=SimpleVectorStore(),
index_store=SimpleIndexStore(),
graph_store=SimpleGraphStore()
)
# Load in sample docs
documents = SimpleDirectoryReader("test-docs").load_data()
# Set up index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
# Querying
query_engine = index.as_query_engine()
response = query_engine.query("What was the conclusion of the market research?")
response.response # Normal response as expected
Index 2
test_index = VectorStoreIndex.from_documents([], storage_context=storage_context)
query_engine = test_index.as_query_engine()
response = query_engine.query("What are the follow-up steps after the conducted market research?")) # Errors on a KeyError
response.response
I mean... I'm not sure what the use case is for doing this π
I think the preferred way would be to do test_index = load_index_from_storage(storage_context)
Right now I have a setup where indices can be created without directly having documents assigned to them. Afterwards I have a function that can populate the created indices with the corresponding documents. Mostly just for separation of concerns. I'm just kind of confused why my second index, which I created without specifying any documents, errors when I attempt to query it and why it does not return None.
I had another case where I created a new empty index (did assign storagecontext) and when I attempted to query it, it synthesized an answer based on documents that were never assigned to it. I understand that it's not the most ideal way of working with LlamaIndex, but I'm just confused how my storagecontext enables it to work like that
tbh I'm a little confused too. Maybe try the code sample I gave above though. I can try and look into the code a bit more later. I would be curious what all the store objects look like before/after creating the second index though
Hey @Logan M , sorry to bother on a friday afternoon but wanted to share some more 'insights'
When I create a plain VectorStore without my StorageContext the docstore is empty as expected. The moment I assign the storagecontext, the docstore is populated. Is this expected behaviour? I feel like I'm just misunderstanding how the storagecontext and the docstore work but honestly I'm not too sure without taking a deeper look into the code first.
I would think that the docstore would remain empty for the newly created index, even though they are using the same docstore in the background. Are docstores shared throughout all indices that are persisted in that storagecontext? I can't really find anything on the readthedocs but maybe I misread something
yes! docstores are shared, as are vector indexes. It's the index store that keeps track of which of node ids are available to that index
Ah! So it makes sense then that the docstore is indeed visible once storagecontext is assigned to it
I still, however, run into the problem that newly created indices somehow have access to documents from other indices when querying
I genuinely cant seem to figure it out but I'm willing to put in some work. Do you have any clue where I could start looking / debugging?
hmm, I have a second to figure this out right now actually. Can you share the code again that was causing issues?
The code is kind of bloated, I'll try to give you all the relevant code
I'll tag you when I have something presentable
Thanks! Appreciate it! π
For context: it's a FastAPI Backend where we want to manage indices and documents with LLamaIndex. Indices are named
collections
, so whenever that pops up, its just another name for an Index
This is the entire flow for creating Indices. Below we the method which is called right after a POST call is done to /collection
- This method gets called right after the POST call is made and the request is parsed
def create_collection(
self,
collection: CollectionModel,
) -> CollectionModel:
collection.index = self._initialize_index(collection.id)
self.persist_collection(collection)
return collection
- Below we create the empty index. The
service_context
and the storage_context
are lifetime variables in our API, these get set during initialization and do not get modified afterwards
def _initialize_index(self, id: str) -> BaseIndex:
index = VectorStoreIndex(
[], # TODO: Discuss empty initialization, research using from_documents with empty array
service_context=self.service_context,
storage_context=self.storage_context,
)
index.set_index_id(id)
return index
- afterwards I immediately persist it which is nothing more then
def persist_collection(self, collection_model: CollectionModel) -> None:
collection_model.index.storage_context.persist() # type: ignore
-----------------------------
If a request is done to ask a question / query an index we end up in this method.
- We retrieve an index
2 . We query said index
- Parse response and source nodes for use in application (irrelevant so will be left out)
def ask_question(
self,
question: QuestionModel,
) -> QuestionModel:
collection = self.collection_service.get_collection_by_id(
question.collection_id,
)
response = self.question_repository.ask_question(
question.question,
collection,
)
if response is None:
return question
question.answer = response.response # type: ignore
question.sources = self.parse_sources(response) # type: ignore
return question
- The retrieval of an index (collection) is done as follows:
def get_collection_by_id(self, id: str) -> CollectionModel:
index = load_index_from_storage(
storage_context=self.storage_context,
index_id=id,
)
return CollectionModel(id=id, index=index)
- The actual
querying
of the index is done in the question_repository
, which is this function
class QuestionRepository:
def ask_question(
self,
question: str,
collection: CollectionModel,
) -> Response:
query_engine = collection.index.as_query_engine() # type: ignore
response = query_engine.query(question)
return response # type: ignore
The only other thing that might be of interest is how we add documents to our indices. This method is a bit scuffed right now, but what it does is as follows:
- retrieve an index / 'collection' from the storagecontext
- Create a temporarydirectory, this is so that we can read the document with a SimpleDirectoryReader (mostly convenience)
- Load the doc with the directoryreader
- We have one "Document" object, which contains multiple "document chunks" , as the directoryreader splits them up
- We parse the document_chunks (mostly to keep track of the IDs)|
- Loop through all chunks and add to the previously retrieved collection
- persist after closing the temporary directory
def add_document(
self,
document: DocumentModel,
) -> DocumentModel:
collection: CollectionModel = (
self.collection_service.get_collection_by_id( # noqa E501
document.collection_id,
)
)
with tempfile.TemporaryDirectory() as tempdir:
document.file_path = self.create_tmp_document(
tempdir,
collection_id=document.collection_id,
file=document.file, # type: ignore
)
document_loaded: List[Document] = SimpleDirectoryReader(
document.file_path,
).load_data()
document.document_chunks = self.parse_document_chunks(
document_loaded,
)
for loaded_doc in document_loaded:
self.document_repository.add_document(loaded_doc, collection)
self.collection_service.persist_collection(collection)
return document
The adding of a document isnt't that shocking, we just do this.
def add_document(
self,
document: Document,
collection: CollectionModel,
) -> None:
collection.index.insert(document) # type: ignore
@Logan M I think this might be all
I removed all error handling, docstrings and some irrelevant fluff
Cool thanks!
And so like, what was the root problem then? π
The root problem is that I can create an empty index using the 1st flow
and then query it with the 2nd flow
It starts to use documents from other indices
So I create an empty index, I ask it a question, I expect it to fail and say something like "cannot be answered with the given context" etc, but instead it synthesizes an answer using documents from other indices
Obviously for this to occur I first have to have created an index, and added documents to it with this method. But when I then create a new index AFTER creating the first and then query it, I start receiving Responses which use sources from the first created index
Interestingly enough, this persists even when I reboot my API. I could reboot it right now, create an empty index, query over it, and all of a sudden it returns a Response with sources from an index I created before
Might be of use to know that we are using a WeaviateVectorStore, MongoDocumentStore and MongoIndexStore in the API
Wow, this is pretty complicated to parse hahaha
So, you use a storage context that is setup to point to weaviate/mongo
Then, when you instansiate an empty index with this storage context, it is still picking up documents from other indexes
So I think the root of the problem is you should probaly be using index names/prefixes for mongo and weaviate
In order to keep the indexes seperated
oh my lord that might be something
I am using one generic prefix in weaviate atm
Let me double check the class APIs for these
I do think I remember one generic prefix not being a problem, as weaviate still adds its own ID to the classname if I recall correctly
Yea so for weaviate, you can specify an index_name
For mongo, you can specify a namespace
-- I think this needs to be different for both the index and docstore though
ATM I setup weaviate
once
def _setup_weaviate_vectorstore(app: FastAPI):
"""Setup Weaviate connection."""
auth_config = weaviate.AuthApiKey(settings.weaviate_pass)
weaviate_client = weaviate.Client(str(settings.weaviate_uri), auth_config)
vector_store = WeaviateVectorStore(
weaviate_client=weaviate_client,
class_prefix=settings.weaviate_class_prefix,
)
app.state.weaviate_client = vector_store
Where did you find the index_name
variable?
It's the same as class_prefix actually haha just that class_prefix is deprecated
Ah! That might be a part of my problem then!
Because I specify the classname it gets set once and it just stores everything under the same name
Okookokokok I am so excited, I've been stuck on this for about two days now πΉ
dang haha wish I could have helped hash this out sooner!
No worries, I should've asked for help sooner too :p
I think there's a path forward now at least π
You don't have a buy me a coffee link, do you π
One more thing, If I'm reading the weaviate API correctly, that does mean I need to re-instantiate the WeaviateVectorStore everytime I am working with a new Index?
nah haha don't worry about it! I get paid for this already π
But i think that's a fast process?
Yeah that's not the end of the world, I might even be able to just set the index_name on the fly
More so for cleanliness purposes, as it is a lifetime variable right now I'd like to keep it that way if possible
I'll just play around with it, once again thanks a lot
Appreciate the fast responses too!
sounds good! Let me know if anything else comes up! πͺ