Find answers from the community

Updated 5 months ago

Another question - in the query engine,

At a glance

Another question - in the query engine, we are using the default COMPACT response sythethizer but we noticed that it does a significant amount of chunking which seems to significantly increase the costs and the latency. According to the documentation this seems to be normal ( https://docs.llamaindex.ai/en/stable/module_guides/querying/response_synthesizers/ ).
Is there any way of disabling the chunking in any shape or form ?

5 comments

LLogan M

I can almost gauruntee chunking is not causing any latency lol

Its making compact chunks to reduce LLM calls

99% of all latency comes from how long the LLM takes to respond, and how many LLM calls are being made

LLogan M

(the chunking here is actually doing you a favor by reducing LLM calls 😅 )

JJAX

yeah, my bad, wrong choice of words 😄 - so basically we saw that for some of the responses there were 5 - 7 LLM calls which yeah adds some extra latency.

we would like to see how it would behave if we can completely disable the chunking part (basically remove the refine part and keep it fully compact)

is this possible?

LLogan M

set the response mode to simple_summary (name is maybe misleading) -- this will truncate the retrieved text so that a single LLM call is made

JJAX

got it - thank you as always sir!

Add a reply