Find answers from the community

Updated 4 months ago

Hey,

Hey,

I built a small chat bot application for a university project with llama index and it went pretty well. Thanks for contributing and building llamaIndex ❤️
I have one quesion however and that would be what i as a developer could do about the response times of the API. I use the vercel chat api for the frontend and the loading times can be 20-30 seconds long which a lot of users complained about.
L
S
b
11 comments
Anything you can do to give users a sense that things are still working behind the scenes

Streaming the final response is a good option. If you build an event handler, you can also stream some select internal events to let the user have some sense of progress as well
I'm using the default template of llama-create (or build upon that and just modified it minimally). I already use the streaming response, but it still takes a lot of time until the first chars are getting retrieved. I don't know how the internals of useChat from vercel AI work and if you can get some performance boosts there
In the default template they have a loading icon, before the first chars are printed. But the problem is that the loading itself takes too long according to the students. They don't want to wait 30+ seconds before the responses come because its quicker to search with "CTRL+F"
Just wanted to know what kind of possible optimizations can be made here
Are you using openai? some other model?

I cant remember as well, what the default chat engine is. If its an agent, that certainly makes sense. If its just a context chat engine, it should likely be faster (but could be limited based on the LLM being used)
I use the OpenAIAgent, maybe i should swap to the Context Engine. I think i tried to swap already, but i had a different problem with the Context Engine, don't remember what was wrong there.
And what do you mean by limited?
I think maybe you remember it, because you commented on it, my friend and i wrote the custom synthesizer which i'm using.
https://github.com/run-llama/llama_index/pull/14439#issuecomment-2195513666

I extended this by filtering the nodes by some context window size.
Plain Text
import os
import logging
from typing import Any, AsyncGenerator, List, Optional, Sequence

from llama_index.core.prompts.mixin import PromptDictType
from llama_index.core.response_synthesizers.base import BaseSynthesizer
from llama_index.core.types import RESPONSE_TEXT_TYPE
from llama_index.core.base.response.schema import (
    RESPONSE_TYPE,
)
from llama_index.core.schema import (
    MetadataMode,
    NodeWithScore,
    QueryType,
)
import llama_index.core.instrumentation as instrument

dispatcher = instrument.get_dispatcher(__name__)

logger = logging.getLogger(__name__)

QueryTextType = QueryType
import tiktoken

# Default value for SOURCE_CONTEXT_WINDOW
DEFAULT_SOURCE_CONTEXT_WINDOW = int(os.getenv("SOURCE_CONTEXT_WINDOW", "4096"))

async def empty_response_agenerator() -> AsyncGenerator[str, None]:
    yield "Empty Response"

class NoLLM(BaseSynthesizer):
    def __init__(self, separator: str = "\n\n---\n\n", source_context_window: int = DEFAULT_SOURCE_CONTEXT_WINDOW):
        super().__init__()
        self.separator = separator
        self.source_context_window = source_context_window
        self.encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")
    
    def _get_prompts(self) -> PromptDictType:
        """Get prompts."""
    
    def _update_prompts(self, prompts_dict: PromptDictType) -> None:
        """Update prompts."""

    def get_response(
        self,
        query_str: str,
        text_chunks: Sequence[str],
        **response_kwargs: Any,
    ) -> RESPONSE_TEXT_TYPE:
        return self.separator.join(text_chunks)

    async def aget_response(
        self,
        query_str: str,
        text_chunks: Sequence[str],
        **response_kwargs: Any,
    ) -> RESPONSE_TEXT_TYPE:
        return self.separator.join(text_chunks)
Plain Text
    def _filter_response_and_nodes(self, source_nodes: List[NodeWithScore]) -> List[NodeWithScore]:
        sorted_nodes = sorted(source_nodes, key=lambda node: node.score, reverse=True)

        separator_tokens = self.encoder.encode(self.separator)
        separator_token_count = len(separator_tokens)

        highest_score_node = sorted_nodes[0]
        total_tokens = self.encoder.encode(highest_score_node.node.get_content(metadata_mode=MetadataMode.LLM))
        
        filtered_nodes = [highest_score_node]
        token_count = len(total_tokens)

        for node in sorted_nodes[1:]:
            node_tokens = self.encoder.encode(node.node.get_content(metadata_mode=MetadataMode.LLM))
            if token_count + len(node_tokens) + separator_token_count <= self.source_context_window:
                filtered_nodes.append(node)
                token_count += len(node_tokens) + separator_token_count

        return filtered_nodes
    
    # TODO: For now we just want to support the asynchronous call
    @dispatcher.span
    async def asynthesize(
        self,
        query: QueryTextType,
        nodes: List[NodeWithScore],
        additional_source_nodes: Optional[Sequence[NodeWithScore]] = None,
        **response_kwargs: Any,
    ) -> RESPONSE_TYPE:
        filtered_nodes = self._filter_response_and_nodes(nodes)
        return await super().asynthesize(query=query, nodes=filtered_nodes, additional_source_nodes=additional_source_nodes, **response_kwargs)
    
    @dispatcher.span
    def synthesize(
        self,
        query: QueryTextType,
        nodes: List[NodeWithScore],
        additional_source_nodes: Optional[Sequence[NodeWithScore]] = None,
        **response_kwargs: Any,
    ) -> RESPONSE_TYPE:
        filtered_nodes = self._filter_response_and_nodes(nodes)
        return super().synthesize(query=query, nodes=filtered_nodes, additional_source_nodes=additional_source_nodes, **response_kwargs)
That's what i'm doing currently
But i think this is not the reason for the slow loading times?
what about extracting the context using:
Plain Text
context = " ".join([node.dict()['node']['text'] for node in response.source_nodes])
print(context)
Add a reply
Sign up and join the conversation on Discord