In addition to alpaca you ll also need

At a glance

In addition to alpaca, you'll also need and embed_model. By default it uses openAI text-ada-002 (which is pretty cheap thankfully).

You can use any model from huggingface locally, using this guide: https://gpt-index.readthedocs.io/en/latest/how_to/customization/embeddings.html#custom-embeddings

16 comments

SSergio Casero

OMG it works! Thanks!

SSergio Casero

But it's strange, I get more than one response

SSergio Casero

hahaha

LLogan M

Nice!

Sounds like you just need to debug where you are calling llama then 😅

SSergio Casero

Yep it looks like the "call" method or something is wrong

SSergio Casero

Not using pipeline, could be that?

Plain Text

class CustomLLM(LLM):
    model_name = "bertin"
    
    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        input = ""
        temperature = 0.1
        device = "cuda"
        top_p = 0.75
        top_k = 40
        num_beams = 4
        max_new_tokens = num_output

        prompt = prompter.generate_prompt(prompt, input)
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
          temperature=temperature,
          top_p=top_p,
          top_k=top_k
        )

        with torch.no_grad():
          generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
          )
        s = generation_output.sequences[0]
        output = tokenizer.decode(s)
        response = prompter.get_response(output)

        print(prompt)
        print(response)

        prompt_length = len(prompt)
        # response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]

        # only return newly generated tokens
        return response[prompt_length:]

SSergio Casero

(obviously everything is hardcoded because of testing)

LLogan M

I think with or without a pipeline should be fine, assuming you are using the model properly 🤔

That code you pasted looks fine to me though. Are your print statements logging duplicates? Or do you have an example of the problem?

SSergio Casero

I see some times instruction and response, the first "Response" looks pretty good

SSergio Casero

But maybe it's because tokens size?

LLogan M

So, llama index will iterate over chunks when all the text does not fit in a single prompt.

The first pass gets an initial answer, then we ask the model to refine that original answer using new context.

So, it's giving multiple responses yes. But at the end, it should only return a single answer.

Yea it looks like maybe the input is getting too big maybe? Are you using a prompt helper object? We might need to setup some extra config to limit the prompt sizes.

SSergio Casero

Yes, I'm using PromptHelper

SSergio Casero

Jmmm wait a minute, I have a print(prompt) and print(response) per call()

SSergio Casero

The prompt_helper

Plain Text

max_input_size = 2048
num_output = 256
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

SSergio Casero

If I remove the prints, I get Empty response haha, Time to debug

LLogan M

Yea your prompt helper looks good to me!

Maybe the return in _call isn't returning what it should be 🤔

Otherwise, you might have the best luck stepping through code with a debugger lol

Add a reply

Find answers from the community

In addition to alpaca you ll also need