Hey guys so im building an app using Llama-Index and Ollama models. Im trying to figure out the argument to limit the number of tokens the model outputs to respond to a query. Im currently using max_tokens. I have tried, new_max_tokens, num_output, and token_limit. None of these have been able to limit the models response. Just wanted to see if anyone has figured out the proper argument.
Again this is using Llama-Index using the import: from llama_index.llms.ollama import Ollama Here's my current setup : return Ollama(model=llm_config["model"], request_timeout=30.0, device=llm_config["device"], temperature=temperature, max_tokens=100)
Any solutions/guidance/links to repos or docs/help would be phenomenal. I was told this was a question for Llama-Index by the guys at Ollama. Thanks!
I tried that and it didnt work. I was looking through the base.py file and see that its in the LLMMetadata function as num_output. But its not in the init and its not a Field like the temperature and context window arguments are. Im guessing its just not implemented yet?