Find answers from the community

Updated last year

Hello everyone,

Hello everyone,

I intend to utilize openchat_3.5 as my Language Model (LLM) instead of ChatGPT for Retrieval-Augmented Generation. To achieve this, I've successfully downloaded the openchat_3.5.Q8_0.gguf model onto my computer. I'm employing the llama_cpp library to establish a connection to the LLM, as illustrated below:
Plain Text
from llama_cpp import Llama

llm = Llama(model_path="/Users/developer/ai/models/openchat_3.5.Q8_0.gguf", n_gpu_layers=1, n_ctx=2048)

Now, I'm seeking guidance on how to link LlamaIndex to the local LLM, such as openchat_3.5.Q8_0.gguf.

Thank you.
S
b
15 comments
Hi,
I used huggingface to load other models, makes it much easier. (https://gpt-index.readthedocs.io/en/latest/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html)
IDK if this answers your question, Im just a newby here, but I hope it helps πŸ™‚
however, I used the AutoModelForCausalLM.from_pretrained method, you can find some samples for that ,too.
Hi @Semirke,
Thank you for the response. I have reviewed the information at https://docs.llamaindex.ai/en/stable/examples/customization/llms/SimpleIndexDemo-Huggingface_camel.html, and it seems to align with my requirements. I have a question: Does the class HuggingFaceLLM aim to connect directly to HuggingFace? If so, could you explain the significance of the following parameters:
Plain Text
tokenizer_name="Writer/camel-5b-hf",
model_name="Writer/camel-5b-hf",

As I'm relatively new to exploring the AI field, my goal is to develop Large Language Model Applications.
Best regards
look for "huggingface models", you'll see they have a model repository
it downlaods automagically (the public ones)
The following code doesn't compile
Plain Text
import logging
import sys
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.prompts import PromptTemplate

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = PromptTemplate(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)


llm = HuggingFaceLLM(
    context_window=2048,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.25, "do_sample": False},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="openchat/openchat_3.5",
    model_name="openchat/openchat_3.5",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)
service_context = ServiceContext.from_defaults(chunk_size=512, llm=llm)
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
print(response)
May you help me?
The compiler is complaining:
Plain Text
  File "/Users/developer/Library/Caches/pypoetry/virtualenvs/playground-2AP3SaSf-py3.11/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2674, in from_pretrained
    raise ImportError(
ImportError: Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate: `pip install accelerate`
It seems to work
ye, it's always worth actually reading the output messages πŸ˜„
Add a reply
Sign up and join the conversation on Discord