Find answers from the community

Updated 6 months ago

Hey guys so im building an app using

At a glance

The community member is building an app using Llama-Index and Ollama models and is trying to figure out how to limit the number of tokens the model outputs in response to a query. They have tried using max_tokens, new_max_tokens, num_output, and token_limit, but none of these have worked. The community members are seeking guidance or solutions from the community on how to properly limit the model's response.

The comments suggest trying num_predict, but the community member says it didn't work. They also note that num_output is in the LLMMetadata function, but it's not in the init and not a Field, so they guess it's not implemented yet. Another community member suggests that it's likely a keyword argument that "just sails through".

The community members discuss the possibility of using additional_kwargs={"num_predict": 256} for Ollama, but note that this would just cut off the response if it's longer than the limit, rather than catering the response to the limit. The final comment suggests that "prompt engineering" might be a way

Hey guys so im building an app using Llama-Index and Ollama models. Im trying to figure out the argument to limit the number of tokens the model outputs to respond to a query. Im currently using max_tokens. I have tried, new_max_tokens, num_output, and token_limit. None of these have been able to limit the models response. Just wanted to see if anyone has figured out the proper argument.

Again this is using Llama-Index using the import: from llama_index.llms.ollama import Ollama
Here's my current setup : return Ollama(model=llm_config["model"], request_timeout=30.0, device=llm_config["device"], temperature=temperature, max_tokens=100)

Any solutions/guidance/links to repos or docs/help would be phenomenal. I was told this was a question for Llama-Index by the guys at Ollama.
Thanks!
J
J
L
10 comments
I tried that and it didnt work. I was looking through the base.py file and see that its in the LLMMetadata function as num_output. But its not in the init and its not a Field like the temperature and context window arguments are. Im guessing its just not implemented yet?
Attachment
metadata.png
It's likely a kwarg that just sails through.
So there's no way to explicitly set it to a value?
I think it would be additional_kwargs={"num_predict": 256} for ollama?
In the llm constructor?
Note that this would cut off the llm response even if its not done
Thanks for your help. Any idea if theres a way to make it cater its response to the num_predict limit versus just cutting it off?
Prompt engineering
Okay yeah that’s kind of what I figured. Thanks again for your help @Logan M
Add a reply
Sign up and join the conversation on Discord