Find answers from the community

Updated 2 years ago

Is there yet an example of running LLaMa

At a glance
Is there yet an example of running LLaMa-2 (-7B-chat) using an interface with something like llama.cpp vs. the Replicate API or the HuggingFace local interface (which seems slow)?
L
c
a
48 comments
We have a demo here using a llama2 instance from Replicate

https://github.com/jerryjliu/llama_index/blob/main/docs/examples/vector_stores/SimpleIndexDemoLlama2.ipynb

Note the llama utils extra functions imported to transform the prompts

It's really important llama-2 prompts are formatted proper we found
https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py
I found this and it's super helpful! I'm hoping to run llama2 locally though, not using the replicate API, and not sure the best route to pursue
llama.cpp installed with GPU support is probably the best bet. Although due to the prompt formatting, you'll probably have to wrap it in a CustomLLM class in order to format the prompts using those util functions
Got it! I'll check that out, though I might resign myself to using the Replicate API in that case until the prompts don't require so much custom work. Relatedly, that example requires OpenAI API keys, why is that?
There's two models in llama index, the LLM and embedding model

The example only changes the llm, so the emebdding model defaults to openai still
Hi I just started using llama instead of PalM and saw there is no llama_index.utils I found messages_to_prompt in generic utils but have not been able to find completion_to_prompt
What version of llama-index do you have? Maybe just install from source for now (it's definitely on github at least)

pip install --upgrade git+https://github.com/jerryjliu/llama_index
i will do that thank you!
that worked great one last question i think for the night which is the model endpoint I tried using the file path that it is stored in but am getting an error I see that the formatting on the documentation is not a filepath but I am confused how to access the model I have it downloaded
Sorry, not sure I know what you mean, what example are you following?
ah gotcha. Yea that example is specific to using replicate

If you have the model files locally, you probably are using huggingface right?

You can load the model and tokenizer, and pass them input our huggingface wrapper. It might look something like this (of course, move the model to cuda if you have the GPU for it, otherwise it will be slow af haha)

I realize this is more complex, but that's how local llms go right now haha

Plain Text
from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")

from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import Prompt

BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt_str = """\
You are a helpful, respectful and honest assistant. \
Always answer as helpfully as possible, while being safe.  \
Your answers should not include any harmful, unethical, racist, sexist, toxic, \
dangerous, or illegal content. Please ensure that your responses are socially \
unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, \
explain why instead of answering something not correct. \
If you don't know the answer to a question, please don't share false information.\
"""

query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{completion.strip()} {E_INST}"
)

llm = HuggingFaceLLM(
    tokenizer=tokenizer,
    model=model,
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_kwargs={"max_length": 4096},
)

service_context = ServiceContext.from_defaults(llm=llm)
hopefully that works haha haven't actually done this yet
ah okay I will try this right now thank you so much for all your help and I will let you know how this works
and also i was not using hugging face i downloaded the models from metas git
Ah right that's a typo
ah okay its running now so i will wait and see the output lol its taking a while to download
ah gotcha. Yea that example is specific to using replicate

If you have the model files locally, you probably are using huggingface right?

You can load the model and tokenizer, and pass them input our huggingface wrapper. It might look something like this (of course, move the model to cuda if you have the GPU for it, otherwise it will be slow af haha)

I realize this is more complex, but that's how local llms go right now haha

Plain Text
 query_wrapper_prompt=(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{{query_str}} {E_INST}"
)


If you still get an error, I thiiiink it should look like that. Basically you want to leave the query_str variable in the string (it gets filled in later by llama index)
okay thank you so much you do not understand how much i appreciate all your help
Haha no worries, happy to help :dotsHARDSTYLE:
lol sorry to be back but would it be any faster to implement llama 2 through this https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/huggingface.py or is it the same thing as the code listed above? Also is the llama_api still accessible through llama_index.llms
sorry one more thing the reason im trying to switch it is because it took five hours to generate one answer to a question and that seems like a long time even though i did not use a GPU
tbh that sounds about right haha

If you want to run on CPU in a meaningful speed, I would check out llama.cpp
We are working on our own integration, but for now the langchain version also works
Plain Text
from llama_index.llms import LangChainLLM

llm = LangChanLLM(LlamaCpp(...))
service_context = ServiceContext.from_defaults(llm=llm)
lol okay i just wanted to double check thank you and sorry is the first llama api accessible through your llms
sorry i think i have the latest version 7.11.post1 and it is not coming up
Maybe it's only in the source right now? pip install --upgrade git+https://github.com/jerryjliu/llama_index
Im not sure sorry i tried that and it did not work
really? Did it give an error installing?
maybe you need to uninstall first? pip uninstall llama-index and then run that command?
no i didnt but i will try that
nevermind it worked i tried it again
Hmm, I don't think there's any need to re-implement the entire HuggingFaceLLM class right?

Double check dependencies
Plain Text
pip install bitsandbytes>=0.39.0
pip install --upgrade accelerate
pip install --upgrade transformers


Then this might work well I think?
Plain Text
from llama_index.prompts import Prompt
from llama_index.llms import HuggingFaceLLM

BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

system_prompt_str = """\
You are a helpful, respectful and honest assistant. \
Always answer as helpfully as possible, while being safe.  \
Your answers should not include any harmful, unethical, racist, sexist, toxic, \
dangerous, or illegal content. Please ensure that your responses are socially \
unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, \
explain why instead of answering something not correct. \
If you don't know the answer to a question, please don't share false information.\
"""

query_wrapper_prompt = Prompt(
    f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
    f"{{query_str}} {E_INST}"
)

tokenizer = LlamaTokenizer.from_pretrained(model_type)

# you can also try replacing `load_in_4bit` with `load_in_8bit`
model = LlamaForCausalLM.from_pretrained(model_type, device_map="auto", load_in_4bit=True)

llm = HuggingFaceLLM(
  model=model,
  tokenizer=tokenizer,
  context_window=4096,
  max_new_tokens=256,
  query_wrapper_prompt=query_wrapper_prompt,
  generate_kwargs={"temperature": 0.7, "do_sample": False},
  tokenizer_kwargs={"max_length": 4096})
)
ah okay yeah i dont really know what I had in mind but it makes sense not to have to recreate the class but should i specify the model and tokenizer when initializing the llm
oh right yea I forgot that part lol
Updated the code above
to include that
awesome thank you!
Sorry another question as well I am having trouble connecting the model from my huggingFace LLM to a GPU and would love some advice on how to do it
would it be something like "model = LlamaForCausalLM.from_pretrained(model_type)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
Hmm that doesn't move it to cuda?

You could also try

model = model.cuda()
Sorry first time working with a gpu
Add a reply
Sign up and join the conversation on Discord