Could anyone familiar with getting

At a glance

A community member is seeking help with getting Llamaindex working with Llamacpp on macOS/Apple Silicon, specifically with getting the GPU to work. Another community member provides step-by-step instructions, including installing the necessary packages and configuring the LlamaCPP object. The instructions involve setting the n_gpu_layers parameter to -1 to use the GPU. The community member who originally posted the question confirms that the provided solution worked and they were able to get the GPU to be used.

ddigital_dream64

Could anyone familiar with getting Llamaindex working with Llamacpp on Macos/Apple Silicon please message me to help me with something? It has to do with getting the GPU to work.

10 comments

LLogan M

Just for you, I spun this up on my mac 😉

Here's the steps

In a fresh terminal

Plain Text

python -m venv venv
source venv/bin/activate
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --no-cache-dir --force-reinstall
pip install llama-index llama-index-llms-llama-cpp

Then, I ran this code

Plain Text

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

And in the terminal, I see

Plain Text

llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU

I get about 20 Tokens/sec after testing with a few prompts

ddigital_dream64

Wow ok thanks 😅

ddigital_dream64

Lemme run it and try

ddigital_dream64

And yeah thats what I'm looking for in my terminal as well

ddigital_dream64

Its working thank you so much!! So in reinstalling I discovered that I installed llama-index at some point in the past so i had to go to my original python location and delete the files in the packages

ddigital_dream64

But yeah that and -1 was working

ddigital_dream64

I finally got the GPU to be used

ddigital_dream64

Thank you for putting in the effort

ddigital_dream64

As usual it was user error 😅

LLogan M

haha no worries! Glad to get it sorted

Add a reply

Find answers from the community

Could anyone familiar with getting