Is there any way I can improve the

At a glance

The community members are experiencing slow performance with the Retrieve call in their Llama index-based chat application. They are using Mongo as a document and index store, and PGVector on Azure as a vector store. The community members have provided details on how they initialize the index and chat engine, and have tried querying the vector store directly, which takes around 3 seconds.

The community members have discussed several potential solutions, including: - Investigating the use of HNSW in the PGVector implementation, as it may improve retrieval time - Considering alternative vector stores, such as Qdrant, which one community member claims has faster retrieval times - Exploring the use of a "one table per document" approach instead of storing metadata in a JSON column, as the metadata filtering may be causing performance issues

There is no explicitly marked answer, but the community members are actively discussing and exploring ways to improve the performance of their Retrieve call.

NNiels

Is there any way I can improve the performance of the Retrieve call Llama index does? It is very slow in out chat application in production:

_CBEventType.RETRIEVE -> 7.412136 seconds

26 comments

LLogan M

You might have to give some more details lol

LLogan M

Are you using a vector db? Do you have a retrieval step beyond just grabbing top-k chunks? Are you using a custom retriever?

NNiels

I think we have quite a basic setup using mongo as a document and index store and using pgvector on azure as a vector store.

NNiels

How we initialise the index:

Plain Text

embed_model = OpenAIEmbedding(
    api_key=OPENAI_API_KEY, temperature=0, model=EMBEDDING_MODEL,
)

callback_manager_basic = CallbackManager([
    LlamaDebugHandler(print_trace_on_end=True),
    get_token_counter(MODEL_BASIC),
])

callback_manager_premium = CallbackManager([
    LlamaDebugHandler(print_trace_on_end=True),
    get_token_counter(MODEL_PREMIUM),
])

service_context_basic = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_BASIC, timeout=180),
    callback_manager=callback_manager_basic,
    embed_model=embed_model,
    context_window=16385,
    chunk_size_limit=16385,
)

service_context_premium = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_PREMIUM, timeout=180),
    callback_manager=callback_manager_premium,
    embed_model=embed_model,
    context_window=128000,
    chunk_size_limit=128000,
)


def initialize_index(model_name: str = MODEL_BASIC) -> VectorStoreIndex:
    """Initialize the index.

    Args:
    ----
        model_name (str, optional): The model name. Defaults to MODEL_BASIC.

    Returns:
    -------
        Any: The initialized index.

    """
    service_context = service_context_basic if model_name == MODEL_BASIC else service_context_premium

    vector_store = PGVectorStore.from_params(
        async_connection_string=f"postgresql+asyncpg://{user}:{password}@{host}:{port}/{database}",
        connection_string=f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}?sslmode=require",
        table_name=PG_VECTOR_DATABASE_DOC_TABLE_NAME,
        embed_dim=1536,
        hybrid_search=True,
    )

    storage_context = StorageContext.from_defaults(
        docstore=document_store,
        index_store=index_store,
        vector_store=vector_store,
    )

    return VectorStoreIndex(
        nodes=[],
        storage_context=storage_context,
        service_context=service_context,
        use_async=True,
    )

NNiels

How we init chat engine:

Plain Text

def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
    """Initialize chat engine with chat history."""
    chat_history = get_chat_history(document_uuid)

    filters = MetadataFilters(
        filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
    )

    return index.as_chat_engine(
        chat_mode=ChatMode.CONTEXT,
        condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
        chat_history=chat_history,
        agent_chat_response_mode="StreamingAgentChatResponse",
        similarity_top_k=10,
        filters=filters,
    )

NNiels

And then we just query it

NNiels

Is there like a way to debug what is happening under the hood

LLogan M

Hmmm. I wonder if the hybrid mode on pgvector just really sucks (that wouldn't surprise me lol)

You can test this by attempting to query the vector store object directly

Plain Text

from llama_index.vector_store.types import VectorStoreQuery 

query = VectoStoreQuery(
  query_embedding=embed_model.get_query_embedding("my query"),
  similarity_top_k=10,
  filters=filters
)

res = vector_store.query(query)

NNiels

Hey @Logan M, thanks so much for the tip! I just checked and querying directly also takes 3 seconds (which is way better than 8 already luckily).

With this code:

Plain Text

def query_directly(uuid: str):
    database = PG_VECTOR_DATABASE_NAME
    host = PG_VECTOR_DATABASE_HOST
    password = PG_VECTOR_DATABASE_PASSWORD
    port = PG_VECTOR_DATABASE_PORT
    user = PG_VECTOR_DATABASE_USER
    vector_store = PGVectorStore.from_params(
        async_connection_string=f"postgresql+asyncpg://{user}:{password}@{host}:{port}/{database}",
        connection_string=f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}?sslmode=require",
        table_name=PG_VECTOR_DATABASE_DOC_TABLE_NAME,
        embed_dim=1536,
    )
    query = VectorStoreQuery(
        query_embedding=embed_model.get_query_embedding("Summarize this document for me"),
        similarity_top_k=10,
        filters=MetadataFilters(
            filters=[ExactMatchFilter(key="doc_id", value=uuid)],
        )
    )

    # calculate time for query
    start_time = time.time()
    res = vector_store.query(query)
    end_time = time.time()
    duration = end_time - start_time
    print(f"Query time: {duration} seconds.")
    return res

Plain Text

web-1          | Query time: 3.0898852348327637 seconds.

What would you suggest in order to improve the performance? Ideally the OpenAI stream would start as soon as possible.

LLogan M

With the top k that high, it is likely making multiple LLM calls to refine an answer. The stream cant start until the last LLM call 👀

LLogan M

Tbh more than person has pointed out that our PGVector implementation isn't using HNSW -- if you have a lot of data, the retrieval time could likely be improved if that was fixed

LLogan M

3s is quite long tbh. Using qdrant for example, with a huge amount of data, my retrieval time is about 1sec or less

NNiels

With top_k=1 it is also 3sec...

NNiels

Hmm okay. I do think we have a lot of data in our DB so that could be the case.

LLogan M

yea HNSW would probably help here then -- I'm just a pgvector noob in terms of syntax lol

LLogan M

in any case, 3s it is for now I guess 😅

NNiels

I would be curious if you had any suggestions on which vector store to use in terms of performance and pricing for cloud based solutions. We also tried pinecone which ended up being SUPER expensive.

LLogan M

pinecone is super pricey yea

NNiels

We are looking for a managed solution ideally on Azure to cut our costs (free creds :P)

LLogan M

qdrant is pretty nice tbh -- hosted version, and you can also self deploy to k8s etc.

NNiels

Thanks a lot! Will try that. Hopefully it is a drop in replacement (it never is lol).

LLogan M

Well, youd have to reindex your data, but beyond that, it shouuuuld be drop in? 🙏

NNiels

So difficult with no real vector db background lol. Started a cloud instance and don’t even know how to make a collection lol

LLogan M

llama-index should create the collection for you when you insert something 👀

kkevingoed

hey @Logan M quick question on this one from my end too. As @Niels said we switched over to PGVector on Azure Postgres. Last night shit hit the fan. The database grew to over 300k rows and simple queries were taking forever (timing out after 5 mins). I believe because of the metadata filter which sifts through the metadata_ JSON column. I think I mentioned to you before that previously we had a table per document (sort of as a different namespace). You actually recommended to use the metadata filter instead and store it all in one table. I'm wondering if its maybe better to use the "one table per doc" approach. Let me know your thoughts.

LLogan M

I don't know a ton about postgres optimizations -- I'm guessing there is a few things that could be done to optimize this
a) not using a JSON blob for metadata, instead having actual columns (this means the table schema has to be defined upfront though -- not sure at all how to do this with sqlalchemy)
b) as you said, one table per doc could be fine, I'm unsure how postgres scales with number of tables though. At least for other vector dbs, they advocate against this approach (which is where the intuition came from), but other vector dbs are likely more optimized than postgres I guess

Add a reply

Find answers from the community

Is there any way I can improve the