Is there yet an example of running LLaMa

At a glance

Is there yet an example of running LLaMa-2 (-7B-chat) using an interface with something like llama.cpp vs. the Replicate API or the HuggingFace local interface (which seems slow)?

48 comments

LLogan M

We have a demo here using a llama2 instance from Replicate

https://github.com/jerryjliu/llama_index/blob/main/docs/examples/vector_stores/SimpleIndexDemoLlama2.ipynb

Note the llama utils extra functions imported to transform the prompts

It's really important llama-2 prompts are formatted proper we found
https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py

ccodydh

I found this and it's super helpful! I'm hoping to run llama2 locally though, not using the replicate API, and not sure the best route to pursue

LLogan M

llama.cpp installed with GPU support is probably the best bet. Although due to the prompt formatting, you'll probably have to wrap it in a CustomLLM class in order to format the prompts using those util functions

ccodydh

Got it! I'll check that out, though I might resign myself to using the Replicate API in that case until the prompts don't require so much custom work. Relatedly, that example requires OpenAI API keys, why is that?

LLogan M

There's two models in llama index, the LLM and embedding model

The example only changes the llm, so the emebdding model defaults to openai still

LLogan M

More details on changing that here:
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#embedding-model-integrations

aaszaiman1

Hi I just started using llama instead of PalM and saw there is no llama_index.utils I found messages_to_prompt in generic utils but have not been able to find completion_to_prompt

LLogan M

What version of llama-index do you have? Maybe just install from source for now (it's definitely on github at least)

pip install --upgrade git+https://github.com/jerryjliu/llama_index

LLogan M

@aszaiman1 ^

aaszaiman1

i will do that thank you!

aaszaiman1

that worked great one last question i think for the night which is the model endpoint I tried using the file path that it is stored in but am getting an error I see that the formatting on the documentation is not a filepath but I am confused how to access the model I have it downloaded

LLogan M

Sorry, not sure I know what you mean, what example are you following?

aaszaiman1

no worries and this one https://github.com/jerryjliu/llama_index/blob/main/docs/examples/vector_stores/SimpleIndexDemoLlama2.ipynb

LLogan M

ah gotcha. Yea that example is specific to using replicate

If you have the model files locally, you probably are using huggingface right?

You can load the model and tokenizer, and pass them input our huggingface wrapper. It might look something like this (of course, move the model to cuda if you have the GPU for it, otherwise it will be slow af haha)

I realize this is more complex, but that's how local llms go right now haha

Plain Text

from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import Prompt

BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt_str = """\
You are a helpful, respectful and honest assistant. \
Always answer as helpfully as possible, while being safe.  \
Your answers should not include any harmful, unethical, racist, sexist, toxic, \
dangerous, or illegal content. Please ensure that your responses are socially \
unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, \
explain why instead of answering something not correct. \
If you don't know the answer to a question, please don't share false information.\
"""

query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{completion.strip()} {E_INST}"
)

llm = HuggingFaceLLM(
    tokenizer=tokenizer,
    model=model,
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_kwargs={"max_length": 4096},
)

service_context = ServiceContext.from_defaults(llm=llm)

LLogan M

hopefully that works haha haven't actually done this yet

aaszaiman1

ah okay I will try this right now thank you so much for all your help and I will let you know how this works

aaszaiman1

and also i was not using hugging face i downloaded the models from metas git

LLogan M

Ah right that's a typo

aaszaiman1

ah okay its running now so i will wait and see the output lol its taking a while to download

LLogan M

Plain Text

 query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{{query_str}} {E_INST}"
)

If you still get an error, I thiiiink it should look like that. Basically you want to leave the query_str variable in the string (it gets filled in later by llama index)

aaszaiman1

okay thank you so much you do not understand how much i appreciate all your help

LLogan M

Haha no worries, happy to help :dotsHARDSTYLE:

aaszaiman1

lol sorry to be back but would it be any faster to implement llama 2 through this https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/huggingface.py or is it the same thing as the code listed above? Also is the llama_api still accessible through llama_index.llms

aaszaiman1

sorry one more thing the reason im trying to switch it is because it took five hours to generate one answer to a question and that seems like a long time even though i did not use a GPU

LLogan M

tbh that sounds about right haha

If you want to run on CPU in a meaningful speed, I would check out llama.cpp

LLogan M

https://python.langchain.com/docs/integrations/llms/llamacpp

LLogan M

We are working on our own integration, but for now the langchain version also works

LLogan M

Plain Text

from llama_index.llms import LangChainLLM

llm = LangChanLLM(LlamaCpp(...))
service_context = ServiceContext.from_defaults(llm=llm)

aaszaiman1

lol okay i just wanted to double check thank you and sorry is the first llama api accessible through your llms

LLogan M

It should be! https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_api.html

aaszaiman1

sorry i think i have the latest version 7.11.post1 and it is not coming up

LLogan M

Maybe it's only in the source right now? pip install --upgrade git+https://github.com/jerryjliu/llama_index

aaszaiman1

Im not sure sorry i tried that and it did not work

LLogan M

really? Did it give an error installing?

LLogan M

maybe you need to uninstall first? pip uninstall llama-index and then run that command?

aaszaiman1

no i didnt but i will try that

aaszaiman1

nevermind it worked i tried it again

LLogan M

Hmm, I don't think there's any need to re-implement the entire HuggingFaceLLM class right?

Double check dependencies

Plain Text

pip install bitsandbytes>=0.39.0
pip install --upgrade accelerate
pip install --upgrade transformers

Then this might work well I think?

Plain Text

from llama_index.prompts import Prompt
from llama_index.llms import HuggingFaceLLM

BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt_str = """\
You are a helpful, respectful and honest assistant. \
Always answer as helpfully as possible, while being safe.  \
Your answers should not include any harmful, unethical, racist, sexist, toxic, \
dangerous, or illegal content. Please ensure that your responses are socially \
unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, \
explain why instead of answering something not correct. \
If you don't know the answer to a question, please don't share false information.\
"""

query_wrapper_prompt = Prompt(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{{query_str}} {E_INST}"
)

tokenizer = LlamaTokenizer.from_pretrained(model_type)

# you can also try replacing `load_in_4bit` with `load_in_8bit`
model = LlamaForCausalLM.from_pretrained(model_type, device_map="auto", load_in_4bit=True)

llm = HuggingFaceLLM(
  model=model,
  tokenizer=tokenizer,
  context_window=4096,
  max_new_tokens=256,
  query_wrapper_prompt=query_wrapper_prompt,
  generate_kwargs={"temperature": 0.7, "do_sample": False},
  tokenizer_kwargs={"max_length": 4096})
)

aaszaiman1

ah okay yeah i dont really know what I had in mind but it makes sense not to have to recreate the class but should i specify the model and tokenizer when initializing the llm

LLogan M

oh right yea I forgot that part lol

LLogan M

Updated the code above

LLogan M

to include that

aaszaiman1

awesome thank you!

aaszaiman1

Sorry another question as well I am having trouble connecting the model from my huggingFace LLM to a GPU and would love some advice on how to do it

aaszaiman1

would it be something like "model = LlamaForCausalLM.from_pretrained(model_type)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

LLogan M

Hmm that doesn't move it to cuda?

You could also try

model = model.cuda()

aaszaiman1

Ah thank you!

aaszaiman1

Sorry first time working with a gpu

Add a reply

Find answers from the community

Is there yet an example of running LLaMa