LlamaIndex

Log inLog into community

Find answers from the community

Updated 12 months ago

Index

Index

At a glance

The community member is using Google Cloud Run to deploy their RAG app, but is finding it quite slow, sometimes taking 10 seconds to execute the code. The issue seems to be related to index storage, as the community member is saving the index storage and .txt files locally.

The community member has tried using a hosted vector database like Qdrant or Chroma instead of saving the data locally, as suggested by other community members. However, the performance has gotten worse after implementing Qdrant. The community member is also having issues with the embedding model they are using, as it takes a long time to download the model files.

The community member has tried various approaches, including caching the model in the Docker file and testing other models, but the performance issues persist. Eventually, the community member decided to move to a TypeScript library, which has improved the performance.

Useful resources

·

Hey guys,

I'm using Google Cloud Run to deploy my RAG app. I finding the app quite slow, sometimes takes 10 seconds to execute the code.
It seems to be related to index storage. Locally, I'm saving the index storage in a folder and also my .txt files.

What people are doing out there in a real production app ?

Here is my docker. I define my VOLUMES but I don't think this is the best approach.

Plain Text

# Use the official Python 3.11 image as the base image
FROM --platform=linux/amd64 python:3.11

# Set the working directory in the container
WORKDIR /code
VOLUME /code/data
VOLUME /code/storage

# Copy the requirements.txt file into the container at /code
COPY requirements.txt .

# Install any needed dependencies specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code into the container at /app
COPY . .

# Specify the command to run your application
# CMD [ "python", "app.py" ]
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9090"]
# CMD ["uvicorn", "app.main:app", "--reload"]

L

T

25 comments

Normally you'd have your data stored in some hosted vector db, rather than saving locally (at least when you have more than a handful of data)

when I say index data I mean files like:

default__vector_store.json,
docstore.json,
graph_store.json,
image__vector_store and
index_store.json

this is under /storage folder

under /data I have .txt files

I'm using VectorStoreIndex btw

Plain Text

import os.path
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage,
)

PERSIST_DIR = "/tmp/storage"


def get_docs_index():
    # check if storage already exists
    if not os.path.exists(PERSIST_DIR):
        # load the documents and create the index
        documents = SimpleDirectoryReader("data").load_data()
        index = VectorStoreIndex.from_documents(documents)
        # store it for later
        index.storage_context.persist(persist_dir=PERSIST_DIR)
    else:
        # load the existing index
        storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
        index = load_index_from_storage(storage_context)

    return index

Yea -- you don't need any of that if you use a hosted vector db Integration (qdrant, weaviate, pinecone, etc)

Load times are essentially a no-op in this setup

ok. I will research about.

so, this idea of using google drive is a bad idea ?
https://llamahub.ai/l/readers/llama-index-readers-google?from=readers

That's moreso meant for reading data and then putting it into an index

I think I see what you mean. What's making things slow and more complex for me is that I have save/load the index result to disk. Using a cloud vector solution should improve it a lot.

I saw qdrant has a free cloud option

but Chroma seems to be way more popular than qdrant

but not cloud options

Qdrant is very nice tbh (it's what I would recommend trying anyways)

They have their own cloud option, and also stuff for deploying/hosting yourself too

Hey @Logan M, sorry to keep bugging you. I have implemented cloud qdrant. It's working. But the performance is getting worse than before.

Plain Text

def get_qdrant_index():
    client = get_qdrant_client()

    # load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()

    vector_store = QdrantVectorStore(client=client, collection_name="serraventura_cv")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
    )

    return index

To give some context. I'm trying to build an API using uvicorn and FastAPI.

My docker CMD
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "9090"]

Is there any problem with uvicorn/fastAPI to work with LlamaIndex?

Because locally, without using both(uvicorn/fastAPI) just executing the script it takes 5 seconds. When I'm using uvicorn/fastAPI it takes 5 mins or more.

My logs from my docker:

Plain Text

ort_config.json: 100%|██████████| 1.27k/1.27k [00:00<00:00, 1.62MB/s]
config.json: 100%|██████████| 740/740 [00:00<00:00, 1.62MB/s]
special_tokens_map.json: 100%|██████████| 695/695 [00:00<00:00, 1.71MB/s]
tokenizer_config.json: 100%|██████████| 1.24k/1.24k [00:00<00:00, 2.63MB/s]
.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 2.77MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 2.77MB/s]s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 2.22MB/s]
model_optimized.onnx: 100%|██████████| 218M/218M [00:10<00:00, 20.9MB/s]
Fetching 8 files: 100%|██████████| 8/8 [00:11<00:00,  1.41s/it]15.9MB/s] 
2024-04-02 17:56:39 INFO:     Started server process [1]
2024-04-02 17:56:39 INFO:     Waiting for application startup.
2024-04-02 17:56:39 INFO:     Application startup complete.
2024-04-02 17:56:39 INFO:     Uvicorn running on http://0.0.0.0:9090 (Press CTRL+C to quit)

Seems like you are downloading a lot of files on startup -- this is unrelated to qdrant or the vector db

it seems to be coming from Settings.embed_model = FastEmbedEmbedding(model_name="BAAI/bge-base-en-v1.5")

I will test other models

Yes, this line is definitely the problem. Even choosing another smaller model(BAAI/bge-small-en-v1.5) it takes forever. I will need to review my approach with qdrant.

thanks anyway 🙂

You probably want to have the model cached inside your docker file, otherwise it will always download on startup

that's a good idea but I think it will only help with a serverless cold start. Locally my container keeps running, so, this model is download just once. New requests to the API don't download again and it's still super slow. My code is based on their documentation. The only difference is that I'm using cloud qdrant.

I know it might be a skill issue but the rest of the docs from qdrant are not helping either. I will give up on qdrant for now

Are you running embeddings on gpu? The only other thing to slow it down (in my opinion) is running models locally

Hi thanks for the help I ended up moving to the typescript lib and things are moving smoother

Add a reply

Sign up and join the conversation on Discord