Find answers from the community

Updated 4 months ago

Hey guys so im building an app using

Hey guys so im building an app using Llama-Index and Ollama models. Im trying to figure out the argument to limit the number of tokens the model outputs to respond to a query. Im currently using max_tokens. I have tried, new_max_tokens, num_output, and token_limit. None of these have been able to limit the models response. Just wanted to see if anyone has figured out the proper argument.

Again this is using Llama-Index using the import: from llama_index.llms.ollama import Ollama
Here's my current setup : return Ollama(model=llm_config["model"], request_timeout=30.0, device=llm_config["device"], temperature=temperature, max_tokens=100)

Any solutions/guidance/links to repos or docs/help would be phenomenal. I was told this was a question for Llama-Index by the guys at Ollama.
Thanks!
J
J
L
10 comments
I tried that and it didnt work. I was looking through the base.py file and see that its in the LLMMetadata function as num_output. But its not in the init and its not a Field like the temperature and context window arguments are. Im guessing its just not implemented yet?
Attachment
metadata.png
It's likely a kwarg that just sails through.
So there's no way to explicitly set it to a value?
I think it would be additional_kwargs={"num_predict": 256} for ollama?
In the llm constructor?
Note that this would cut off the llm response even if its not done
Thanks for your help. Any idea if theres a way to make it cater its response to the num_predict limit versus just cutting it off?
Prompt engineering
Okay yeah that’s kind of what I figured. Thanks again for your help @Logan M
Add a reply
Sign up and join the conversation on Discord