Find answers from the community

Updated last year

Hi I ran into a weird issue today and I

At a glance
Hi :) I ran into a weird issue today and I'm not sure how to handle it. I created a storage_context with all 'simple' stores, SimpleDocument, -vector, -index and -graph stores. I then created a VectorStoreIndex.from_documents() with some sample documents from my SimpleDirectoryReader and assigned the storage_context. I was then able to query it as expected and retrieved normal answers. However, I then created another VectorStore, this time not providing any documents, just an empty array [] and a reference to the StorageContext (same as used in the 1st vector store). When I want to query the second VectorStore, instead of getting None as a response, I get a KeyError on one of the DocIDs of my original VectorStore. In another instance of playing around with it I created an empty VectorStoreIndex, queried it, and all of a sudden I was actually getting results from the documents assigned to the other VectorStoreIndex
O
L
56 comments
For context. This is my first index
Plain Text
from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index.vector_stores import SimpleVectorStore
from llama_index.storage import StorageContext
from llama_index.graph_stores import SimpleGraphStore

# create storage context using default stores
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
    index_store=SimpleIndexStore(),
    graph_store=SimpleGraphStore()
)

# Load in sample docs
documents = SimpleDirectoryReader("test-docs").load_data()

# Set up index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

# Querying
query_engine = index.as_query_engine()
response = query_engine.query("What was the conclusion of the market research?")
response.response # Normal response as expected


Index 2

Plain Text
test_index = VectorStoreIndex.from_documents([], storage_context=storage_context)
query_engine = test_index.as_query_engine()
response = query_engine.query("What are the follow-up steps after the conducted market research?")) # Errors on a KeyError 
response.response 
I mean... I'm not sure what the use case is for doing this πŸ˜… I think the preferred way would be to do test_index = load_index_from_storage(storage_context)
Right now I have a setup where indices can be created without directly having documents assigned to them. Afterwards I have a function that can populate the created indices with the corresponding documents. Mostly just for separation of concerns. I'm just kind of confused why my second index, which I created without specifying any documents, errors when I attempt to query it and why it does not return None.

I had another case where I created a new empty index (did assign storagecontext) and when I attempted to query it, it synthesized an answer based on documents that were never assigned to it. I understand that it's not the most ideal way of working with LlamaIndex, but I'm just confused how my storagecontext enables it to work like that
tbh I'm a little confused too. Maybe try the code sample I gave above though. I can try and look into the code a bit more later. I would be curious what all the store objects look like before/after creating the second index though
Hey @Logan M , sorry to bother on a friday afternoon but wanted to share some more 'insights'

When I create a plain VectorStore without my StorageContext the docstore is empty as expected. The moment I assign the storagecontext, the docstore is populated. Is this expected behaviour? I feel like I'm just misunderstanding how the storagecontext and the docstore work but honestly I'm not too sure without taking a deeper look into the code first.

I would think that the docstore would remain empty for the newly created index, even though they are using the same docstore in the background. Are docstores shared throughout all indices that are persisted in that storagecontext? I can't really find anything on the readthedocs but maybe I misread something
Attachments
image.png
image.png
yes! docstores are shared, as are vector indexes. It's the index store that keeps track of which of node ids are available to that index
Ah! So it makes sense then that the docstore is indeed visible once storagecontext is assigned to it
I still, however, run into the problem that newly created indices somehow have access to documents from other indices when querying
I genuinely cant seem to figure it out but I'm willing to put in some work. Do you have any clue where I could start looking / debugging?
hmm, I have a second to figure this out right now actually. Can you share the code again that was causing issues?
The code is kind of bloated, I'll try to give you all the relevant code
I'll tag you when I have something presentable
Thanks! Appreciate it! πŸ™
For context: it's a FastAPI Backend where we want to manage indices and documents with LLamaIndex. Indices are named collections, so whenever that pops up, its just another name for an Index

This is the entire flow for creating Indices. Below we the method which is called right after a POST call is done to /collection

  1. This method gets called right after the POST call is made and the request is parsed
Plain Text
    def create_collection(
        self,
        collection: CollectionModel,
    ) -> CollectionModel:
        collection.index = self._initialize_index(collection.id)
        self.persist_collection(collection)
        return collection


  1. Below we create the empty index. The service_context and the storage_context are lifetime variables in our API, these get set during initialization and do not get modified afterwards
Plain Text
    def _initialize_index(self, id: str) -> BaseIndex:
        index = VectorStoreIndex(
            [],  # TODO: Discuss empty initialization, research using from_documents with empty array
            service_context=self.service_context,
            storage_context=self.storage_context,
        )
        index.set_index_id(id)
        return index


  1. afterwards I immediately persist it which is nothing more then
Plain Text
    def persist_collection(self, collection_model: CollectionModel) -> None:
        collection_model.index.storage_context.persist()  # type: ignore
-----------------------------

If a request is done to ask a question / query an index we end up in this method.

  1. We retrieve an index
2 . We query said index
  1. Parse response and source nodes for use in application (irrelevant so will be left out)
Plain Text
    def ask_question(
        self,
        question: QuestionModel,
    ) -> QuestionModel:
        collection = self.collection_service.get_collection_by_id(
            question.collection_id,
        )
        response = self.question_repository.ask_question(
            question.question,
            collection,
        )
        if response is None:
            return question
        question.answer = response.response  # type: ignore
        question.sources = self.parse_sources(response)  # type: ignore
        return question



  1. The retrieval of an index (collection) is done as follows:
Plain Text
    def get_collection_by_id(self, id: str) -> CollectionModel:
        index = load_index_from_storage(
            storage_context=self.storage_context,
            index_id=id,
        )
        return CollectionModel(id=id, index=index)


  1. The actual querying of the index is done in the question_repository, which is this function
Plain Text
class QuestionRepository:
    def ask_question(
        self,
        question: str,
        collection: CollectionModel,
    ) -> Response:
        query_engine = collection.index.as_query_engine()  # type: ignore
        response = query_engine.query(question)
        return response  # type: ignore
The only other thing that might be of interest is how we add documents to our indices. This method is a bit scuffed right now, but what it does is as follows:
  1. retrieve an index / 'collection' from the storagecontext
  2. Create a temporarydirectory, this is so that we can read the document with a SimpleDirectoryReader (mostly convenience)
  3. Load the doc with the directoryreader
  4. We have one "Document" object, which contains multiple "document chunks" , as the directoryreader splits them up
  5. We parse the document_chunks (mostly to keep track of the IDs)|
  6. Loop through all chunks and add to the previously retrieved collection
  7. persist after closing the temporary directory
Plain Text
    def add_document(
        self,
        document: DocumentModel,
    ) -> DocumentModel:
        collection: CollectionModel = (
            self.collection_service.get_collection_by_id(  # noqa E501
                document.collection_id,
            )
        )
        with tempfile.TemporaryDirectory() as tempdir:
            document.file_path = self.create_tmp_document(
                tempdir,
                collection_id=document.collection_id,
                file=document.file,  # type: ignore
            )
            document_loaded: List[Document] = SimpleDirectoryReader(
                document.file_path,
            ).load_data()
            document.document_chunks = self.parse_document_chunks(
                document_loaded,
            )
            for loaded_doc in document_loaded:
                self.document_repository.add_document(loaded_doc, collection)
        self.collection_service.persist_collection(collection)
        return document
The adding of a document isnt't that shocking, we just do this.
Plain Text
    def add_document(
        self,
        document: Document,
        collection: CollectionModel,
    ) -> None:
        collection.index.insert(document)  # type: ignore
@Logan M I think this might be all
I removed all error handling, docstrings and some irrelevant fluff
Cool thanks!

And so like, what was the root problem then? πŸ˜…
The root problem is that I can create an empty index using the 1st flow
and then query it with the 2nd flow
It starts to use documents from other indices
So I create an empty index, I ask it a question, I expect it to fail and say something like "cannot be answered with the given context" etc, but instead it synthesizes an answer using documents from other indices
Obviously for this to occur I first have to have created an index, and added documents to it with this method. But when I then create a new index AFTER creating the first and then query it, I start receiving Responses which use sources from the first created index
Interestingly enough, this persists even when I reboot my API. I could reboot it right now, create an empty index, query over it, and all of a sudden it returns a Response with sources from an index I created before
Might be of use to know that we are using a WeaviateVectorStore, MongoDocumentStore and MongoIndexStore in the API
Wow, this is pretty complicated to parse hahaha

So, you use a storage context that is setup to point to weaviate/mongo

Then, when you instansiate an empty index with this storage context, it is still picking up documents from other indexes
So I think the root of the problem is you should probaly be using index names/prefixes for mongo and weaviate
In order to keep the indexes seperated
oh my lord that might be something
I am using one generic prefix in weaviate atm
Let me double check the class APIs for these
I do think I remember one generic prefix not being a problem, as weaviate still adds its own ID to the classname if I recall correctly
Yea so for weaviate, you can specify an index_name

For mongo, you can specify a namespace -- I think this needs to be different for both the index and docstore though
ATM I setup weaviate once

Plain Text
def _setup_weaviate_vectorstore(app: FastAPI):
    """Setup Weaviate connection."""
    auth_config = weaviate.AuthApiKey(settings.weaviate_pass)
    weaviate_client = weaviate.Client(str(settings.weaviate_uri), auth_config)
    vector_store = WeaviateVectorStore(
        weaviate_client=weaviate_client,
        class_prefix=settings.weaviate_class_prefix,
    )
    app.state.weaviate_client = vector_store
Where did you find the index_name variable?
It's the same as class_prefix actually haha just that class_prefix is deprecated
Ah! That might be a part of my problem then!
Because I specify the classname it gets set once and it just stores everything under the same name
Okookokokok I am so excited, I've been stuck on this for about two days now 😹
dang haha wish I could have helped hash this out sooner!
No worries, I should've asked for help sooner too :p
I think there's a path forward now at least πŸ™‚
Thank you so much πŸ™
You don't have a buy me a coffee link, do you πŸ‘€
One more thing, If I'm reading the weaviate API correctly, that does mean I need to re-instantiate the WeaviateVectorStore everytime I am working with a new Index?
nah haha don't worry about it! I get paid for this already πŸ˜†
hmmm, i think so πŸ€”
But i think that's a fast process?
Yeah that's not the end of the world, I might even be able to just set the index_name on the fly
More so for cleanliness purposes, as it is a lifetime variable right now I'd like to keep it that way if possible
I'll just play around with it, once again thanks a lot
Appreciate the fast responses too!
sounds good! Let me know if anything else comes up! πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord