ah gotcha. Yea that example is specific to using replicate
If you have the model files locally, you probably are using huggingface right?
You can load the model and tokenizer, and pass them input our huggingface wrapper. It might look something like this (of course, move the model to
cuda
if you have the GPU for it, otherwise it will be slow af haha)
I realize this is more complex, but that's how local llms go right now haha
from transformers import LlamaForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import Prompt
BOS, EOS = "<s>", "</s>"
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
system_prompt_str = """\
You are a helpful, respectful and honest assistant. \
Always answer as helpfully as possible, while being safe. \
Your answers should not include any harmful, unethical, racist, sexist, toxic, \
dangerous, or illegal content. Please ensure that your responses are socially \
unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, \
explain why instead of answering something not correct. \
If you don't know the answer to a question, please don't share false information.\
"""
query_wrapper_prompt=(
f"{BOS}{B_INST} {B_SYS}{system_prompt_str.strip()}{E_SYS}"
f"{completion.strip()} {E_INST}"
)
llm = HuggingFaceLLM(
tokenizer=tokenizer,
model=model,
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample": False},
query_wrapper_prompt=query_wrapper_prompt,
tokenizer_kwargs={"max_length": 4096},
)
service_context = ServiceContext.from_defaults(llm=llm)