Hi guys! Got a question on using GPU to accelerate inference, the environment should be all set, I have CUDA and Cublas set up for llama-cpp-python. Then I run the following code for LLM
Hi! I have done that too, the only difference is CMAKE_ARGS get a not recognized as internal or external command, so I had to use set CMAKE_ARGS instead, maybe that is the problem ill probe more into this
It takes 4 minutes to answer a query, no prompt about GPU usage, I guess it's still not being accelerated and the response is also not displayed properly