Find answers from the community

Updated last year

Hi I just had a question about using

Hi I just had a question about using LlamaCPP bindings to .complete() a variable number of new tokens for a request
llama_cpp.py/LlamaCPP
Plain Text
    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        self.generate_kwargs.update({"stream": False})

        is_formatted = kwargs.pop("formatted", False)
        if not is_formatted:
            prompt = self.completion_to_prompt(prompt)

        response = self._model(prompt=prompt, **self.generate_kwargs)

        return CompletionResponse(text=response["choices"][0]["text"], raw=response)


Am I missing something or does this function not actually use any of the provided kwargs to update the generated kwargs?
j
L
5 comments
I changed my function locally to test out how it would work to supply max_tokens=NUMBER which does work as I'd expect it to now
Plain Text
    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        self.generate_kwargs.update({"stream": False})

        is_formatted = kwargs.pop("formatted", False)
        if not is_formatted:
            prompt = self.completion_to_prompt(prompt)

        # Want to at inference time decide how many tokens to generate
        # Create a local copy by merging self.generate_kwargs and kwargs
        local_generate_kwargs = {**self.generate_kwargs, **kwargs}
        print(local_generate_kwargs)
        response = self._model(prompt=prompt, **local_generate_kwargs)

        return CompletionResponse(text=response["choices"][0]["text"], raw=response)


I'm just wondering if this is a bug or if I am missing something obvious
I think max tokens is expected to be set on the init, as max_new_tokens

Passing kwargs to complete is quite difficult if you aren't using the LLM directly (usually some higher level abstraction is calling the LLM, not you directly)
you couuuuld update this attribute before using the LLM in a query engine

Plain Text
llm.generate_kwargs['max_tokens'] = 123
But still no way of actually passing through the kwarg to the underlying llamacpppython call that would take max_tokens as a parameter? Interesting I would have thought thats what the kwargs were for
Good to know then, I'll probably just set a higher upper limit and not worry about it for the moment, just wasn't sure if this was missing functionality or intended
Thanks!
Add a reply
Sign up and join the conversation on Discord