Find answers from the community

Updated 6 months ago

Could anyone familiar with getting

At a glance

A community member is seeking help with getting Llamaindex working with Llamacpp on macOS/Apple Silicon, specifically with getting the GPU to work. Another community member provides step-by-step instructions, including installing the necessary packages and configuring the LlamaCPP object. The instructions involve setting the n_gpu_layers parameter to -1 to use the GPU. The community member who originally posted the question confirms that the provided solution worked and they were able to get the GPU to be used.

Could anyone familiar with getting Llamaindex working with Llamacpp on Macos/Apple Silicon please message me to help me with something? It has to do with getting the GPU to work.
L
d
10 comments
Just for you, I spun this up on my mac πŸ˜‰

Here's the steps

In a fresh terminal
Plain Text
python -m venv venv
source venv/bin/activate
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir --force-reinstall
pip install llama-index llama-index-llms-llama-cpp


Then, I ran this code

Plain Text
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)


And in the terminal, I see
Plain Text
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU


I get about 20 Tokens/sec after testing with a few prompts
Wow ok thanks πŸ˜…
Lemme run it and try
And yeah thats what I'm looking for in my terminal as well
Its working thank you so much!! So in reinstalling I discovered that I installed llama-index at some point in the past so i had to go to my original python location and delete the files in the packages
But yeah that and -1 was working
I finally got the GPU to be used
Thank you for putting in the effort
As usual it was user error πŸ˜…
haha no worries! Glad to get it sorted
Add a reply
Sign up and join the conversation on Discord