GPU acceleration problem(unsolved)

At a glance

The community member is trying to use GPU acceleration for inference with the llama-cpp-python library. They have set up CUDA and Cublas, but are not seeing the expected "load 1/X layer to GPU" prompt. The community members discuss various troubleshooting steps, including trying a specific CMAKE_ARGS command, using PowerShell on Windows, and checking for missing CUDA toolset. While some community members report success, the original poster is still experiencing issues, with the inference taking a long time and the response not being displayed properly. There is no explicitly marked answer in the comments.

Useful resources

AA_la_lanterne

Hi guys! Got a question on using GPU to accelerate inference, the environment should be all set, I have CUDA and Cublas set up for llama-cpp-python. Then I run the following code for LLM

12 comments

AA_la_lanterne

llm = LlamaCPP(
model_url='https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf',
temperature=0.3,
max_new_tokens=256,
context_window=3900,
generate_kwargs={},
model_kwargs={"n_gpu_layers": 1},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

AA_la_lanterne

However, I did not get the load 1/X layer to GPU prompt as I expected

AA_la_lanterne

this is the prompt if it could be of any help

BBanaanBakje

Check out this thread: https://discord.com/channels/1059199217496772688/1059200010622873741/1183550467855356004
TLDR: try this:

Plain Text

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

https://github.com/abetlen/llama-cpp-python#cublas

AA_la_lanterne

Hi! I have done that too, the only difference is CMAKE_ARGS get a not recognized as internal or external command, so I had to use set CMAKE_ARGS instead, maybe that is the problem ill probe more into this

BBanaanBakje

if you use windows, try to use powershell, read the thread i sent, there might be some extra info you could use

AA_la_lanterne

Thanks! I managed to make the code pass, the first time I installed nothing happened, now I run the code you gave, I got error prompt

AA_la_lanterne

I guess it's just a problem of missing cuda toolset

AA_la_lanterne

It actually works when I use VSCODE but not when I do it with Anaconda

AA_la_lanterne

GPU acceleration problem(solved)

AA_la_lanterne

GPU acceleration problem(unsolved)

AA_la_lanterne

It takes 4 minutes to answer a query, no prompt about GPU usage, I guess it's still not being accelerated and the response is also not displayed properly

Add a reply

Find answers from the community

GPU acceleration problem(unsolved)