LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

Hi I ran into a weird issue today and I

Hi I ran into a weird issue today and I

At a glance

OOverclockedClock

·

Hi :) I ran into a weird issue today and I'm not sure how to handle it. I created a storage_context with all 'simple' stores, SimpleDocument, -vector, -index and -graph stores. I then created a VectorStoreIndex.from_documents() with some sample documents from my SimpleDirectoryReader and assigned the storage_context. I was then able to query it as expected and retrieved normal answers. However, I then created another VectorStore, this time not providing any documents, just an empty array [] and a reference to the StorageContext (same as used in the 1st vector store). When I want to query the second VectorStore, instead of getting None as a response, I get a KeyError on one of the DocIDs of my original VectorStore. In another instance of playing around with it I created an empty VectorStoreIndex, queried it, and all of a sudden I was actually getting results from the documents assigned to the other VectorStoreIndex

O

L

56 comments

OOverclockedClock

For context. This is my first index

Plain Text

from llama_index.storage.docstore import SimpleDocumentStore
from llama_index.storage.index_store import SimpleIndexStore
from llama_index.vector_stores import SimpleVectorStore
from llama_index.storage import StorageContext
from llama_index.graph_stores import SimpleGraphStore

# create storage context using default stores
storage_context = StorageContext.from_defaults(
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
    index_store=SimpleIndexStore(),
    graph_store=SimpleGraphStore()
)

# Load in sample docs
documents = SimpleDirectoryReader("test-docs").load_data()

# Set up index
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

# Querying
query_engine = index.as_query_engine()
response = query_engine.query("What was the conclusion of the market research?")
response.response # Normal response as expected

Index 2

Plain Text

test_index = VectorStoreIndex.from_documents([], storage_context=storage_context)
query_engine = test_index.as_query_engine()
response = query_engine.query("What are the follow-up steps after the conducted market research?")) # Errors on a KeyError 
response.response

I mean... I'm not sure what the use case is for doing this 😅 I think the preferred way would be to do test_index = load_index_from_storage(storage_context)

OOverclockedClock

Right now I have a setup where indices can be created without directly having documents assigned to them. Afterwards I have a function that can populate the created indices with the corresponding documents. Mostly just for separation of concerns. I'm just kind of confused why my second index, which I created without specifying any documents, errors when I attempt to query it and why it does not return None.

I had another case where I created a new empty index (did assign storagecontext) and when I attempted to query it, it synthesized an answer based on documents that were never assigned to it. I understand that it's not the most ideal way of working with LlamaIndex, but I'm just confused how my storagecontext enables it to work like that

tbh I'm a little confused too. Maybe try the code sample I gave above though. I can try and look into the code a bit more later. I would be curious what all the store objects look like before/after creating the second index though

OOverclockedClock

Hey @Logan M , sorry to bother on a friday afternoon but wanted to share some more 'insights'

When I create a plain VectorStore without my StorageContext the docstore is empty as expected. The moment I assign the storagecontext, the docstore is populated. Is this expected behaviour? I feel like I'm just misunderstanding how the storagecontext and the docstore work but honestly I'm not too sure without taking a deeper look into the code first.

I would think that the docstore would remain empty for the newly created index, even though they are using the same docstore in the background. Are docstores shared throughout all indices that are persisted in that storagecontext? I can't really find anything on the readthedocs but maybe I misread something

Attachments

yes! docstores are shared, as are vector indexes. It's the index store that keeps track of which of node ids are available to that index

OOverclockedClock

Ah! So it makes sense then that the docstore is indeed visible once storagecontext is assigned to it

OOverclockedClock

I still, however, run into the problem that newly created indices somehow have access to documents from other indices when querying

OOverclockedClock

I genuinely cant seem to figure it out but I'm willing to put in some work. Do you have any clue where I could start looking / debugging?

hmm, I have a second to figure this out right now actually. Can you share the code again that was causing issues?

OOverclockedClock

The code is kind of bloated, I'll try to give you all the relevant code

OOverclockedClock

I'll tag you when I have something presentable

Thanks! Appreciate it! 🙏

OOverclockedClock

For context: it's a FastAPI Backend where we want to manage indices and documents with LLamaIndex. Indices are named collections, so whenever that pops up, its just another name for an Index

This is the entire flow for creating Indices. Below we the method which is called right after a POST call is done to /collection

This method gets called right after the POST call is made and the request is parsed

Plain Text

    def create_collection(
        self,
        collection: CollectionModel,
    ) -> CollectionModel:
        collection.index = self._initialize_index(collection.id)
        self.persist_collection(collection)
        return collection

Below we create the empty index. The service_context and the storage_context are lifetime variables in our API, these get set during initialization and do not get modified afterwards

Plain Text

    def _initialize_index(self, id: str) -> BaseIndex:
        index = VectorStoreIndex(
            [],  # TODO: Discuss empty initialization, research using from_documents with empty array
            service_context=self.service_context,
            storage_context=self.storage_context,
        )
        index.set_index_id(id)
        return index

afterwards I immediately persist it which is nothing more then

Plain Text

    def persist_collection(self, collection_model: CollectionModel) -> None:
        collection_model.index.storage_context.persist()  # type: ignore

OOverclockedClock

-----------------------------

If a request is done to ask a question / query an index we end up in this method.

We retrieve an index

2 . We query said index

Parse response and source nodes for use in application (irrelevant so will be left out)

Plain Text

    def ask_question(
        self,
        question: QuestionModel,
    ) -> QuestionModel:
        collection = self.collection_service.get_collection_by_id(
            question.collection_id,
        )
        response = self.question_repository.ask_question(
            question.question,
            collection,
        )
        if response is None:
            return question
        question.answer = response.response  # type: ignore
        question.sources = self.parse_sources(response)  # type: ignore
        return question

The retrieval of an index (collection) is done as follows:

Plain Text

    def get_collection_by_id(self, id: str) -> CollectionModel:
        index = load_index_from_storage(
            storage_context=self.storage_context,
            index_id=id,
        )
        return CollectionModel(id=id, index=index)

The actual querying of the index is done in the question_repository, which is this function

Plain Text

class QuestionRepository:
    def ask_question(
        self,
        question: str,
        collection: CollectionModel,
    ) -> Response:
        query_engine = collection.index.as_query_engine()  # type: ignore
        response = query_engine.query(question)
        return response  # type: ignore

OOverclockedClock

The only other thing that might be of interest is how we add documents to our indices. This method is a bit scuffed right now, but what it does is as follows:

retrieve an index / 'collection' from the storagecontext
Create a temporarydirectory, this is so that we can read the document with a SimpleDirectoryReader (mostly convenience)
Load the doc with the directoryreader
We have one "Document" object, which contains multiple "document chunks" , as the directoryreader splits them up
We parse the document_chunks (mostly to keep track of the IDs)|
Loop through all chunks and add to the previously retrieved collection
persist after closing the temporary directory

Plain Text

    def add_document(
        self,
        document: DocumentModel,
    ) -> DocumentModel:
        collection: CollectionModel = (
            self.collection_service.get_collection_by_id(  # noqa E501
                document.collection_id,
            )
        )
        with tempfile.TemporaryDirectory() as tempdir:
            document.file_path = self.create_tmp_document(
                tempdir,
                collection_id=document.collection_id,
                file=document.file,  # type: ignore
            )
            document_loaded: List[Document] = SimpleDirectoryReader(
                document.file_path,
            ).load_data()
            document.document_chunks = self.parse_document_chunks(
                document_loaded,
            )
            for loaded_doc in document_loaded:
                self.document_repository.add_document(loaded_doc, collection)
        self.collection_service.persist_collection(collection)
        return document

OOverclockedClock

The adding of a document isnt't that shocking, we just do this.

Plain Text

    def add_document(
        self,
        document: Document,
        collection: CollectionModel,
    ) -> None:
        collection.index.insert(document)  # type: ignore

OOverclockedClock

@Logan M I think this might be all

OOverclockedClock

I removed all error handling, docstrings and some irrelevant fluff

Cool thanks!

And so like, what was the root problem then? 😅

OOverclockedClock

The root problem is that I can create an empty index using the 1st flow

OOverclockedClock

and then query it with the 2nd flow

OOverclockedClock

It starts to use documents from other indices

OOverclockedClock

So I create an empty index, I ask it a question, I expect it to fail and say something like "cannot be answered with the given context" etc, but instead it synthesizes an answer using documents from other indices

OOverclockedClock

Obviously for this to occur I first have to have created an index, and added documents to it with this method. But when I then create a new index AFTER creating the first and then query it, I start receiving Responses which use sources from the first created index

OOverclockedClock

Interestingly enough, this persists even when I reboot my API. I could reboot it right now, create an empty index, query over it, and all of a sudden it returns a Response with sources from an index I created before

OOverclockedClock

Might be of use to know that we are using a WeaviateVectorStore, MongoDocumentStore and MongoIndexStore in the API

Wow, this is pretty complicated to parse hahaha

So, you use a storage context that is setup to point to weaviate/mongo

Then, when you instansiate an empty index with this storage context, it is still picking up documents from other indexes

So I think the root of the problem is you should probaly be using index names/prefixes for mongo and weaviate

In order to keep the indexes seperated

OOverclockedClock

oh my lord that might be something

OOverclockedClock

I am using one generic prefix in weaviate atm

Let me double check the class APIs for these

OOverclockedClock

I do think I remember one generic prefix not being a problem, as weaviate still adds its own ID to the classname if I recall correctly

Yea so for weaviate, you can specify an index_name

For mongo, you can specify a namespace -- I think this needs to be different for both the index and docstore though

OOverclockedClock

ATM I setup weaviate once

Plain Text

def _setup_weaviate_vectorstore(app: FastAPI):
    """Setup Weaviate connection."""
    auth_config = weaviate.AuthApiKey(settings.weaviate_pass)
    weaviate_client = weaviate.Client(str(settings.weaviate_uri), auth_config)
    vector_store = WeaviateVectorStore(
        weaviate_client=weaviate_client,
        class_prefix=settings.weaviate_class_prefix,
    )
    app.state.weaviate_client = vector_store

OOverclockedClock

Where did you find the index_name variable?

It's the same as class_prefix actually haha just that class_prefix is deprecated

Attachment

OOverclockedClock

Ah! That might be a part of my problem then!

OOverclockedClock

Because I specify the classname it gets set once and it just stores everything under the same name

OOverclockedClock

Okookokokok I am so excited, I've been stuck on this for about two days now 😹

dang haha wish I could have helped hash this out sooner!

OOverclockedClock

No worries, I should've asked for help sooner too :p

I think there's a path forward now at least 🙂

OOverclockedClock

Thank you so much 🙏

OOverclockedClock

You don't have a buy me a coffee link, do you 👀

OOverclockedClock

One more thing, If I'm reading the weaviate API correctly, that does mean I need to re-instantiate the WeaviateVectorStore everytime I am working with a new Index?

nah haha don't worry about it! I get paid for this already 😆

hmmm, i think so 🤔

But i think that's a fast process?

OOverclockedClock

Yeah that's not the end of the world, I might even be able to just set the index_name on the fly

OOverclockedClock

More so for cleanliness purposes, as it is a lifetime variable right now I'd like to keep it that way if possible

OOverclockedClock

I'll just play around with it, once again thanks a lot

OOverclockedClock

Appreciate the fast responses too!

sounds good! Let me know if anything else comes up! 💪

Add a reply

Sign up and join the conversation on Discord