Find answers from the community

Updated 3 months ago

GPU acceleration problem(unsolved)

Hi guys! Got a question on using GPU to accelerate inference, the environment should be all set, I have CUDA and Cublas set up for llama-cpp-python. Then I run the following code for LLM
A
B
12 comments
llm = LlamaCPP(
model_url='https://huggingface.co/TheBloke/zephyr-7B-alpha-GGUF/resolve/main/zephyr-7b-alpha.Q4_K_M.gguf',
temperature=0.3,
max_new_tokens=256,
context_window=3900,
generate_kwargs={},
model_kwargs={"n_gpu_layers": 1},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
However, I did not get the load 1/X layer to GPU prompt as I expected
this is the prompt if it could be of any help
Check out this thread: https://discord.com/channels/1059199217496772688/1059200010622873741/1183550467855356004
TLDR: try this:
Plain Text
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

https://github.com/abetlen/llama-cpp-python#cublas
Hi! I have done that too, the only difference is CMAKE_ARGS get a not recognized as internal or external command, so I had to use set CMAKE_ARGS instead, maybe that is the problem ill probe more into this
if you use windows, try to use powershell, read the thread i sent, there might be some extra info you could use
Thanks! I managed to make the code pass, the first time I installed nothing happened, now I run the code you gave, I got error prompt
I guess it's just a problem of missing cuda toolset
It actually works when I use VSCODE but not when I do it with Anaconda
GPU acceleration problem(solved)
GPU acceleration problem(unsolved)
It takes 4 minutes to answer a query, no prompt about GPU usage, I guess it's still not being accelerated and the response is also not displayed properly
Add a reply
Sign up and join the conversation on Discord