Find answers from the community

Updated 9 months ago

Hi there, can someone help me debug why

Hi there, can someone help me debug why my app is so slow?

For a basic "Summarize this document" query it takes 10 seconds to even start the streaming response (for a doc that has a few words of content and one node):

Plain Text
@app.post("/document/query")
def query_stream(
    query: str = Body(...),
    uuid_filename: str = Body(...),
    email: str = Body(...),
) -> StreamingResponse:
    subscription = get_user_subscription(email)
    model = MODEL_BASIC if subscription == "FREE" else MODEL_PREMIUM
    with token_counter(model, query_stream.__name__):
        filename_without_ext = uuid_filename.split(".")[0]

        # Create index
        index = initialize_index(model)

        document_is_indexed = does_document_exist_in_index(filename_without_ext)

        if document_is_indexed is False:
            logging.info("Re-adding to index...")
            reindex_document(filename_without_ext)

        if is_summary_request(query):
            query = modify_query_for_summary(query, filename_without_ext, model)

        chat_engine = initialize_chat_engine(index, filename_without_ext)
        streaming_response = chat_engine.stream_chat(query) # takes 10 seconds!!

        def generate() -> Generator[str, any, None]:
            yield from streaming_response.response_gen

        return StreamingResponse(generate(), media_type="text/plain")
N
W
L
19 comments
For more info, this is how we init the chat engine:

Plain Text
def initialize_chat_engine(index: VectorStoreIndex, document_uuid: str) -> BaseChatEngine:
    """Initialize chat engine with chat history."""
    chat_history = get_chat_history(document_uuid)

    filters = MetadataFilters(
        filters=[ExactMatchFilter(key="doc_id", value=document_uuid)],
    )

    return index.as_chat_engine(
        chat_mode=ChatMode.CONTEXT,
        condense_question_prompt=PromptTemplate(CHAT_PROMPT_TEMPLATE),
        chat_history=chat_history,
        agent_chat_response_mode="StreamingAgentChatResponse",
        similarity_top_k=10,
        filters=filters,
    )


I think the slow part has something to do with the filters for large-ish databases, since our database contains more than 1M rows
Although 1M seems like a lot of data I feel like it should still be quicker
You are indexing for every API call, that could be one of the reason to increase the time.

Rest you can use observability tool in llamaindex to identify the exact place where time consumption is taking place.

That will give you a better idea and reason.
@WhiteFang_Jr Thanks! So it is a bad practice to do this on every call?

Plain Text
service_context_basic = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_BASIC, timeout=180),
    callback_manager=callback_manager_basic,
    embed_model=embed_model,
    context_window=16385,
    chunk_size_limit=16385,
)

service_context_premium = ServiceContext.from_defaults(
    llm=OpenAI(temperature=0, model=MODEL_PREMIUM, timeout=180),
    callback_manager=callback_manager_premium,
    embed_model=embed_model,
    context_window=128000,
    chunk_size_limit=128000,
)


def initialize_index(model_name: str = MODEL_BASIC) -> VectorStoreIndex:
    """Initialize the index.

    Args:
    ----
        model_name (str, optional): The model name. Defaults to MODEL_BASIC.

    Returns:
    -------
        Any: The initialized index.

    """
    service_context = service_context_basic if model_name == MODEL_BASIC else service_context_premium

    return VectorStoreIndex(
        nodes=[],
        storage_context=storage_context,
        service_context=service_context,
        use_async=True,
    )
@Logan M Maybe you can help and check if we're making a big mistake here? 😦
premium πŸ’Έ

This seems fine to me. You aren't actually indexing anything, so its mostly no-op
I guess the storage context is a global var?
Summary queries could be slow, because it seems like you are using the LLM to modify the query?
We're using different LLMs based on the user
The reason we need different service contexts is because we need to switch out context windows and chunk size limits
Is there a way to actually capture all of the queries that llama index does under the hood to our pgvector?
yea that makes sense, I just had a chuckle at the naming :p
probably by watching network requests, or using some observability tool like arize?
This tool is actually very handy https://httptoolkit.com
Ahh thanks thats a good one
Now i hate my life because we use the docker compose network :bee_sad:
Ah okay! Since you are not indexing anything then it's fine.

Actually in your initial shared code, only this method reference was present by which I thought you are initialising index every time with documents. That's why I said it may be the reason why time usage is increasing
Add a reply
Sign up and join the conversation on Discord