Find answers from the community

Updated 3 months ago

Is there any way I can improve the

Is there any way I can improve the performance of the Retrieve call Llama index does? It is very slow in out chat application in production:

_CBEventType.RETRIEVE -> 7.412136 seconds
L
N
k
26 comments
You might have to give some more details lol
Are you using a vector db? Do you have a retrieval step beyond just grabbing top-k chunks? Are you using a custom retriever?
I think we have quite a basic setup using mongo as a document and index store and using pgvector on azure as a vector store.
How we initialise the index:

Plain Text
embed_model = OpenAIEmbedding(
    api_key=OPENAI_API_KEY, temperature=0, model=EMBEDDING_MODEL,
)

callback_manager_basic = CallbackManager([
    LlamaDebugHandler(print_trace_on_end=True),
    get_token_counter(MODEL_BASIC),
])

callback_manager_premium = CallbackManager([
    LlamaDebugHandler(print_trace_on_end=True),
    get_token_counter(MODEL_PREMIUM),
])

service_context_basic = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_BASIC, timeout=180),
    callback_manager=callback_manager_basic,
    embed_model=embed_model,
    context_window=16385,
    chunk_size_limit=16385,
)

service_context_premium = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_PREMIUM, timeout=180),
    callback_manager=callback_manager_premium,
    embed_model=embed_model,
    context_window=128000,
    chunk_size_limit=128000,
)


def initialize_index(model_name: str = MODEL_BASIC) -> VectorStoreIndex:
    """Initialize the index.

    Args:
    ----
        model_name (str, optional): The model name. Defaults to MODEL_BASIC.

    Returns:
    -------
        Any: The initialized index.

    """
    service_context = service_context_basic if model_name == MODEL_BASIC else service_context_premium

    vector_store = PGVectorStore.from_params(
        async_connection_string=f"postgresql+asyncpg://{user}:{password}@{host}:{port}/{database}",
        connection_string=f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}?sslmode=require",
        table_name=PG_VECTOR_DATABASE_DOC_TABLE_NAME,
        embed_dim=1536,
        hybrid_search=True,
    )

    storage_context = StorageContext.from_defaults(
        docstore=document_store,
        index_store=index_store,
        vector_store=vector_store,
    )

    return VectorStoreIndex(
        nodes=[],
        storage_context=storage_context,
        service_context=service_context,
        use_async=True,
    )
How we init chat engine:

Plain Text
def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
    """Initialize chat engine with chat history."""
    chat_history = get_chat_history(document_uuid)

    filters = MetadataFilters(
        filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
    )

    return index.as_chat_engine(
        chat_mode=ChatMode.CONTEXT,
        condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
        chat_history=chat_history,
        agent_chat_response_mode="StreamingAgentChatResponse",
        similarity_top_k=10,
        filters=filters,
    )
And then we just query it
Is there like a way to debug what is happening under the hood
Hmmm. I wonder if the hybrid mode on pgvector just really sucks (that wouldn't surprise me lol)

You can test this by attempting to query the vector store object directly

Plain Text
from llama_index.vector_store.types import VectorStoreQuery 

query = VectoStoreQuery(
  query_embedding=embed_model.get_query_embedding("my query"),
  similarity_top_k=10,
  filters=filters
)

res = vector_store.query(query)
Hey @Logan M, thanks so much for the tip! I just checked and querying directly also takes 3 seconds (which is way better than 8 already luckily).

With this code:

Plain Text
def query_directly(uuid: str):
    database = PG_VECTOR_DATABASE_NAME
    host = PG_VECTOR_DATABASE_HOST
    password = PG_VECTOR_DATABASE_PASSWORD
    port = PG_VECTOR_DATABASE_PORT
    user = PG_VECTOR_DATABASE_USER
    vector_store = PGVectorStore.from_params(
        async_connection_string=f"postgresql+asyncpg://{user}:{password}@{host}:{port}/{database}",
        connection_string=f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}?sslmode=require",
        table_name=PG_VECTOR_DATABASE_DOC_TABLE_NAME,
        embed_dim=1536,
    )
    query = VectorStoreQuery(
        query_embedding=embed_model.get_query_embedding("Summarize this document for me"),
        similarity_top_k=10,
        filters=MetadataFilters(
            filters=[ExactMatchFilter(key="doc_id", value=uuid)],
        )
    )

    # calculate time for query
    start_time = time.time()
    res = vector_store.query(query)
    end_time = time.time()
    duration = end_time - start_time
    print(f"Query time: {duration} seconds.")
    return res


Plain Text
web-1          | Query time: 3.0898852348327637 seconds.


What would you suggest in order to improve the performance? Ideally the OpenAI stream would start as soon as possible.
With the top k that high, it is likely making multiple LLM calls to refine an answer. The stream cant start until the last LLM call πŸ‘€
Tbh more than person has pointed out that our PGVector implementation isn't using HNSW -- if you have a lot of data, the retrieval time could likely be improved if that was fixed
3s is quite long tbh. Using qdrant for example, with a huge amount of data, my retrieval time is about 1sec or less
With top_k=1 it is also 3sec...
Hmm okay. I do think we have a lot of data in our DB so that could be the case.
yea HNSW would probably help here then -- I'm just a pgvector noob in terms of syntax lol
in any case, 3s it is for now I guess πŸ˜…
I would be curious if you had any suggestions on which vector store to use in terms of performance and pricing for cloud based solutions. We also tried pinecone which ended up being SUPER expensive.
pinecone is super pricey yea
We are looking for a managed solution ideally on Azure to cut our costs (free creds :P)
qdrant is pretty nice tbh -- hosted version, and you can also self deploy to k8s etc.
Thanks a lot! Will try that. Hopefully it is a drop in replacement (it never is lol).
Well, youd have to reindex your data, but beyond that, it shouuuuld be drop in? πŸ™
So difficult with no real vector db background lol. Started a cloud instance and don’t even know how to make a collection lol
llama-index should create the collection for you when you insert something πŸ‘€
hey @Logan M quick question on this one from my end too. As @Niels said we switched over to PGVector on Azure Postgres. Last night shit hit the fan. The database grew to over 300k rows and simple queries were taking forever (timing out after 5 mins). I believe because of the metadata filter which sifts through the metadata_ JSON column. I think I mentioned to you before that previously we had a table per document (sort of as a different namespace). You actually recommended to use the metadata filter instead and store it all in one table. I'm wondering if its maybe better to use the "one table per doc" approach. Let me know your thoughts.
I don't know a ton about postgres optimizations -- I'm guessing there is a few things that could be done to optimize this
a) not using a JSON blob for metadata, instead having actual columns (this means the table schema has to be defined upfront though -- not sure at all how to do this with sqlalchemy)
b) as you said, one table per doc could be fine, I'm unsure how postgres scales with number of tables though. At least for other vector dbs, they advocate against this approach (which is where the intuition came from), but other vector dbs are likely more optimized than postgres I guess
Add a reply
Sign up and join the conversation on Discord