Find answers from the community

Updated 9 months ago

Index

Hey guys,

I'm using Google Cloud Run to deploy my RAG app. I finding the app quite slow, sometimes takes 10 seconds to execute the code.
It seems to be related to index storage. Locally, I'm saving the index storage in a folder and also my .txt files.

What people are doing out there in a real production app ?

Here is my docker. I define my VOLUMES but I don't think this is the best approach.

Plain Text
# Use the official Python 3.11 image as the base image
FROM --platform=linux/amd64 python:3.11

# Set the working directory in the container
WORKDIR /code
VOLUME /code/data
VOLUME /code/storage

# Copy the requirements.txt file into the container at /code
COPY requirements.txt .

# Install any needed dependencies specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code into the container at /app
COPY . .

# Specify the command to run your application
# CMD [ "python", "app.py" ]
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9090"]
# CMD ["uvicorn", "app.main:app", "--reload"]
L
T
25 comments
Normally you'd have your data stored in some hosted vector db, rather than saving locally (at least when you have more than a handful of data)
when I say index data I mean files like:
  • default__vector_store.json,
  • docstore.json,
  • graph_store.json,
  • image__vector_store and
  • index_store.json
this is under /storage folder
under /data I have .txt files
I'm using VectorStoreIndex btw
Plain Text
import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

PERSIST_DIR = "/tmp/storage"


def get_docs_index():
    # check if storage already exists
    if not os.path.exists(PERSIST_DIR):
        # load the documents and create the index
        documents = SimpleDirectoryReader("data").load_data()
        index = VectorStoreIndex.from_documents(documents)
        # store it for later
        index.storage_context.persist(persist_dir=PERSIST_DIR)
    else:
        # load the existing index
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        index = load_index_from_storage(storage_context)

    return index
Yea -- you don't need any of that if you use a hosted vector db Integration (qdrant, weaviate, pinecone, etc)

Load times are essentially a no-op in this setup
ok. I will research about.
That's moreso meant for reading data and then putting it into an index
I think I see what you mean. What's making things slow and more complex for me is that I have save/load the index result to disk. Using a cloud vector solution should improve it a lot.

I saw qdrant has a free cloud option
but Chroma seems to be way more popular than qdrant
but not cloud options
Qdrant is very nice tbh (it's what I would recommend trying anyways)

They have their own cloud option, and also stuff for deploying/hosting yourself too
Hey @Logan M, sorry to keep bugging you. I have implemented cloud qdrant. It's working. But the performance is getting worse than before.

Plain Text
def get_qdrant_index():
    client = get_qdrant_client()

    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()

    vector_store = QdrantVectorStore(client=client, collection_name="serraventura_cv")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
    )

    return index


To give some context. I'm trying to build an API using uvicorn and FastAPI.

My docker CMD
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9090"]

Is there any problem with uvicorn/fastAPI to work with LlamaIndex?

Because locally, without using both(uvicorn/fastAPI) just executing the script it takes 5 seconds. When I'm using uvicorn/fastAPI it takes 5 mins or more.

My logs from my docker:

Plain Text
ort_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.27k/1.27k [00:00<00:00, 1.62MB/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 740/740 [00:00<00:00, 1.62MB/s]
special_tokens_map.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 695/695 [00:00<00:00, 1.71MB/s]
tokenizer_config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.24k/1.24k [00:00<00:00, 2.63MB/s]
.gitattributes: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.52k/1.52k [00:00<00:00, 2.77MB/s]
vocab.txt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 232k/232k [00:00<00:00, 2.77MB/s]s]
tokenizer.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 711k/711k [00:00<00:00, 2.22MB/s]
model_optimized.onnx: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 218M/218M [00:10<00:00, 20.9MB/s]
Fetching 8 files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:11<00:00,  1.41s/it]15.9MB/s] 
2024-04-02 17:56:39 INFO:     Started server process [1]
2024-04-02 17:56:39 INFO:     Waiting for application startup.
2024-04-02 17:56:39 INFO:     Application startup complete.
2024-04-02 17:56:39 INFO:     Uvicorn running on http://0.0.0.0:9090 (Press CTRL+C to quit)
Seems like you are downloading a lot of files on startup -- this is unrelated to qdrant or the vector db
it seems to be coming from Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")
I will test other models
Yes, this line is definitely the problem. Even choosing another smaller model(BAAI/bge-small-en-v1.5) it takes forever. I will need to review my approach with qdrant.
thanks anyway πŸ™‚
You probably want to have the model cached inside your docker file, otherwise it will always download on startup
that's a good idea but I think it will only help with a serverless cold start. Locally my container keeps running, so, this model is download just once. New requests to the API don't download again and it's still super slow. My code is based on their documentation. The only difference is that I'm using cloud qdrant.
I know it might be a skill issue but the rest of the docs from qdrant are not helping either. I will give up on qdrant for now
Are you running embeddings on gpu? The only other thing to slow it down (in my opinion) is running models locally
Hi thanks for the help I ended up moving to the typescript lib and things are moving smoother
Add a reply
Sign up and join the conversation on Discord