Hi I just had a question about using

jjaykob

Hi I just had a question about using LlamaCPP bindings to .complete() a variable number of new tokens for a request
llama_cpp.py/LlamaCPP

Plain Text

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        self.generate_kwargs.update({"stream": False})

        is_formatted = kwargs.pop("formatted", False)
        if not is_formatted:
            prompt = self.completion_to_prompt(prompt)

        response = self._model(prompt=prompt, **self.generate_kwargs)

        return CompletionResponse(text=response["choices"][0]["text"], raw=response)

Am I missing something or does this function not actually use any of the provided kwargs to update the generated kwargs?

5 comments

jjaykob

I changed my function locally to test out how it would work to supply max_tokens=NUMBER which does work as I'd expect it to now

Plain Text

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        self.generate_kwargs.update({"stream": False})

        is_formatted = kwargs.pop("formatted", False)
        if not is_formatted:
            prompt = self.completion_to_prompt(prompt)

        # Want to at inference time decide how many tokens to generate
        # Create a local copy by merging self.generate_kwargs and kwargs
        local_generate_kwargs = {**self.generate_kwargs, **kwargs}
        print(local_generate_kwargs)
        response = self._model(prompt=prompt, **local_generate_kwargs)

        return CompletionResponse(text=response["choices"][0]["text"], raw=response)

I'm just wondering if this is a bug or if I am missing something obvious

LLogan M

I think max tokens is expected to be set on the init, as max_new_tokens

Passing kwargs to complete is quite difficult if you aren't using the LLM directly (usually some higher level abstraction is calling the LLM, not you directly)

LLogan M

https://github.com/run-llama/llama_index/blob/480ef30d23fe0395bff93c2d383687a73cea702c/llama_index/llms/llama_cpp.py#L128

LLogan M

you couuuuld update this attribute before using the LLM in a query engine

Plain Text

llm.generate_kwargs['max_tokens'] = 123

jjaykob

But still no way of actually passing through the kwarg to the underlying llamacpppython call that would take max_tokens as a parameter? Interesting I would have thought thats what the kwargs were for
Good to know then, I'll probably just set a higher upper limit and not worry about it for the moment, just wasn't sure if this was missing functionality or intended
Thanks!

Add a reply

Find answers from the community

Hi I just had a question about using