Speed

ppaulo

Despite saving my index locally (small file size, <1MB), it takes 30-45 seconds to receive a response for each of my queries. Does anyone know how to speed this up? I'm wondering how some apps can achieve sub 5-10second performance for answer retrieval

8 comments

LLogan M

The biggest bottleneck will be LLM calls. 5-10 seconds is achievable if only one LLM call is made

If you are on 0.6.x, the defaults recently changed to use the compact response mode, which helps a lot with speed

LLogan M

The other option is streaming. Usually the answer can be streamed faster than you can read it, which helps with UX a ton

TTeemu

Yeah the retrieval takes under 1 sec for me, it's the LLM generation that takes time but Im streaming that so there is only like 1 sec delay from entering query to visual feedback

ppaulo

Can I please see how you did it? Here's my code and despite streaming, it still takes a while:

Plain Text

storage_context = StorageContext.from_defaults(persist_dir="./storage")

# define prompt helper
max_input_size = 4096
num_output = 1024
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)


llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, chunk_size_limit=2000)

index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine(service_context=service_context, similarity_top_k=5, streaming=True)

print("loaded")

response = query_engine.query("Answer the question, despite what you answered before. What this meeting about?")

response.print_response_stream()

TTeemu

Your code looks fine, Im currently using top_k=3 but besides that idk. For me after querying it prints out the source nodes in 1 sec, then it starts generating the LLM response but it's GPT-4 so its not crazy fast at generating the whole response

ppaulo

Ah I see, are the source nodes indicated by retrieve if so then mine is decently fast too. However, the get_response takes a while— it doesn't seem like it's streaming to my terminal. Am I calling the print_response_stream() the same way you are?

TTeemu

Well I'm using it in a web app, get source nodes is basically instant. But the print response stream is a stream like ChatGPT responses and it's not crazy fast but it starts generating visual feedback within 1 sec usually

ppaulo

I'm also building a web app but I'm using Next.js. Are you streaming it through an API or writing it directly in your frontend project?

Add a reply

Find answers from the community

Speed