Hi there, can someone help me debug why

At a glance

Hi there, can someone help me debug why my app is so slow?

For a basic "Summarize this document" query it takes 10 seconds to even start the streaming response (for a doc that has a few words of content and one node):

Plain Text

@app.post("/document/query")
def query_stream(
    query: str = Body(...),
    uuid_filename: str = Body(...),
    email: str = Body(...),
) -> StreamingResponse:
    subscription = get_user_subscription(email)
    model = MODEL_BASIC if subscription == "FREE" else MODEL_PREMIUM
    with token_counter(model, query_stream.__name__):
        filename_without_ext = uuid_filename.split(".")[0]

        # Create index
        index = initialize_index(model)

        document_is_indexed = does_document_exist_in_index(filename_without_ext)

        if document_is_indexed is False:
            logging.info("Re-adding to index...")
            reindex_document(filename_without_ext)

        if is_summary_request(query):
            query = modify_query_for_summary(query, filename_without_ext, model)

        chat_engine = initialize_chat_engine(index, filename_without_ext)
        streaming_response = chat_engine.stream_chat(query) # takes 10 seconds!!

        def generate() -> Generator[str, any, None]:
            yield from streaming_response.response_gen

        return StreamingResponse(generate(), media_type="text/plain")

19 comments

NNiels

For more info, this is how we init the chat engine:

Plain Text

def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
    """Initialize chat engine with chat history."""
    chat_history = get_chat_history(document_uuid)

    filters = MetadataFilters(
        filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
    )

    return index.as_chat_engine(
        chat_mode=ChatMode.CONTEXT,
        condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
        chat_history=chat_history,
        agent_chat_response_mode="StreamingAgentChatResponse",
        similarity_top_k=10,
        filters=filters,
    )

I think the slow part has something to do with the filters for large-ish databases, since our database contains more than 1M rows

NNiels

Although 1M seems like a lot of data I feel like it should still be quicker

WWhiteFang_Jr

You are indexing for every API call, that could be one of the reason to increase the time.

Rest you can use observability tool in llamaindex to identify the exact place where time consumption is taking place.

That will give you a better idea and reason.

WWhiteFang_Jr

https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#arize-phoenix

NNiels

@WhiteFang_Jr Thanks! So it is a bad practice to do this on every call?

Plain Text

service_context_basic = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_BASIC, timeout=180),
    callback_manager=callback_manager_basic,
    embed_model=embed_model,
    context_window=16385,
    chunk_size_limit=16385,
)

service_context_premium = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_PREMIUM, timeout=180),
    callback_manager=callback_manager_premium,
    embed_model=embed_model,
    context_window=128000,
    chunk_size_limit=128000,
)


def initialize_index(model_name: str = MODEL_BASIC) -> VectorStoreIndex:
    """Initialize the index.

    Args:
    ----
        model_name (str, optional): The model name. Defaults to MODEL_BASIC.

    Returns:
    -------
        Any: The initialized index.

    """
    service_context = service_context_basic if model_name == MODEL_BASIC else service_context_premium

    return VectorStoreIndex(
        nodes=[],
        storage_context=storage_context,
        service_context=service_context,
        use_async=True,
    )

NNiels

@Logan M Maybe you can help and check if we're making a big mistake here? 😦

LLogan M

premium 💸

This seems fine to me. You aren't actually indexing anything, so its mostly no-op

LLogan M

I guess the storage context is a global var?

LLogan M

Summary queries could be slow, because it seems like you are using the LLM to modify the query?

LLogan M

But not sure

NNiels

We're using different LLMs based on the user

NNiels

The reason we need different service contexts is because we need to switch out context windows and chunk size limits

NNiels

Is there a way to actually capture all of the queries that llama index does under the hood to our pgvector?

LLogan M

yea that makes sense, I just had a chuckle at the naming :p

LLogan M

probably by watching network requests, or using some observability tool like arize?

LLogan M

This tool is actually very handy https://httptoolkit.com

NNiels

Ahh thanks thats a good one

NNiels

Now i hate my life because we use the docker compose network :bee_sad:

WWhiteFang_Jr

Ah okay! Since you are not indexing anything then it's fine.

Actually in your initial shared code, only this method reference was present by which I thought you are initialising index every time with documents. That's why I said it may be the reason why time usage is increasing

Add a reply

Find answers from the community

Hi there, can someone help me debug why