Find answers from the community

Updated 2 months ago

Speed

Despite saving my index locally (small file size, <1MB), it takes 30-45 seconds to receive a response for each of my queries. Does anyone know how to speed this up? I'm wondering how some apps can achieve sub 5-10second performance for answer retrieval
L
T
p
8 comments
The biggest bottleneck will be LLM calls. 5-10 seconds is achievable if only one LLM call is made

If you are on 0.6.x, the defaults recently changed to use the compact response mode, which helps a lot with speed
The other option is streaming. Usually the answer can be streamed faster than you can read it, which helps with UX a ton
Yeah the retrieval takes under 1 sec for me, it's the LLM generation that takes time but Im streaming that so there is only like 1 sec delay from entering query to visual feedback
Can I please see how you did it? Here's my code and despite streaming, it still takes a while:


Plain Text
storage_context = StorageContext.from_defaults(persist_dir="./storage")

# define prompt helper
max_input_size = 4096
num_output = 1024
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)


llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, chunk_size_limit=2000)

index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine(service_context=service_context, similarity_top_k=5, streaming=True)

print("loaded")

response = query_engine.query("Answer the question, despite what you answered before. What this meeting about?")

response.print_response_stream()
Your code looks fine, Im currently using top_k=3 but besides that idk. For me after querying it prints out the source nodes in 1 sec, then it starts generating the LLM response but it's GPT-4 so its not crazy fast at generating the whole response
Ah I see, are the source nodes indicated by retrieve if so then mine is decently fast too. However, the get_response takes a while— it doesn't seem like it's streaming to my terminal. Am I calling the print_response_stream() the same way you are?
Well I'm using it in a web app, get source nodes is basically instant. But the print response stream is a stream like ChatGPT responses and it's not crazy fast but it starts generating visual feedback within 1 sec usually
I'm also building a web app but I'm using Next.js. Are you streaming it through an API or writing it directly in your frontend project?
Add a reply
Sign up and join the conversation on Discord