The community member is trying to use GPU acceleration for inference with the llama-cpp-python library. They have set up CUDA and Cublas, but are not seeing the expected "load 1/X layer to GPU" prompt. The community members discuss various troubleshooting steps, including trying a specific CMAKE_ARGS command, using PowerShell on Windows, and checking for missing CUDA toolset. While some community members report success, the original poster is still experiencing issues, with the inference taking a long time and the response not being displayed properly. There is no explicitly marked answer in the comments.
Hi guys! Got a question on using GPU to accelerate inference, the environment should be all set, I have CUDA and Cublas set up for llama-cpp-python. Then I run the following code for LLM
Hi! I have done that too, the only difference is CMAKE_ARGS get a not recognized as internal or external command, so I had to use set CMAKE_ARGS instead, maybe that is the problem ill probe more into this
It takes 4 minutes to answer a query, no prompt about GPU usage, I guess it's still not being accelerated and the response is also not displayed properly