Find answers from the community

Updated 3 months ago

In addition to alpaca you ll also need

In addition to alpaca, you'll also need and embed_model. By default it uses openAI text-ada-002 (which is pretty cheap thankfully).

You can use any model from huggingface locally, using this guide: https://gpt-index.readthedocs.io/en/latest/how_to/customization/embeddings.html#custom-embeddings
S
L
16 comments
OMG it works! Thanks!
But it's strange, I get more than one response
Nice!

Sounds like you just need to debug where you are calling llama then 😅
Yep it looks like the "call" method or something is wrong
Not using pipeline, could be that?

Plain Text
class CustomLLM(LLM):
    model_name = "bertin"
    
    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        input = ""
        temperature = 0.1
        device = "cuda"
        top_p = 0.75
        top_k = 40
        num_beams = 4
        max_new_tokens = num_output

        prompt = prompter.generate_prompt(prompt, input)
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"].to(device)
        generation_config = GenerationConfig(
          temperature=temperature,
          top_p=top_p,
          top_k=top_k
        )

        with torch.no_grad():
          generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
          )
        s = generation_output.sequences[0]
        output = tokenizer.decode(s)
        response = prompter.get_response(output)

        print(prompt)
        print(response)

        prompt_length = len(prompt)
        # response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]

        # only return newly generated tokens
        return response[prompt_length:]
(obviously everything is hardcoded because of testing)
I think with or without a pipeline should be fine, assuming you are using the model properly 🤔

That code you pasted looks fine to me though. Are your print statements logging duplicates? Or do you have an example of the problem?
I see some times instruction and response, the first "Response" looks pretty good
But maybe it's because tokens size?
So, llama index will iterate over chunks when all the text does not fit in a single prompt.

The first pass gets an initial answer, then we ask the model to refine that original answer using new context.

So, it's giving multiple responses yes. But at the end, it should only return a single answer.

Yea it looks like maybe the input is getting too big maybe? Are you using a prompt helper object? We might need to setup some extra config to limit the prompt sizes.
Yes, I'm using PromptHelper
Jmmm wait a minute, I have a print(prompt) and print(response) per call()
The prompt_helper

Plain Text
max_input_size = 2048
num_output = 256
max_chunk_overlap = 20
prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)
If I remove the prints, I get Empty response haha, Time to debug
Yea your prompt helper looks good to me!

Maybe the return in _call isn't returning what it should be 🤔

Otherwise, you might have the best luck stepping through code with a debugger lol
Add a reply
Sign up and join the conversation on Discord