Reduce llm calls

ttakeura

hi. is it possible to get 3 related source_texts when querying without using similarity_top_k?
If I include similarity_top_k, the response time is almost k times longer, so I would like to query the API only once.

3 comments

LLogan M

The main issue at hand is context length. The max input to openai models (minus the fanciest gpt4 model) is 4097 tokens.

So if you set top k to 3, and the text from each node is 3900 tokens (default) the node text + prompt + query needs to happen across 3 calls.

You can try and reduce this time by setting something like chunk_size_limit=512 in the service context when constructing the index.

Then in your query you can set index.query(..., response_mode="compact"), which in theory means everything will fit into one LLM call.

The downside with this is tjat small chunks make the real answer a little harder to find. But, it can still work well.

LLogan M

Another option is setting response_mode="no_text" in the query, which will skip calling the LLM entirely and only return the source nodes in the response object

ttakeura

thank you!! It's what I was looking for!!!

Add a reply

Find answers from the community

Reduce llm calls