I think somewhere logan mentioned this that if GPU is present, Embedding model will use GPU directly.
hmm weird... im getting high cpu usage
what am i supposed to set the n kwargs gpu layer to?
Ok i can confirm that even when inferencing, its not using gpu
it was working before, not sure what I changed. Steps to troubleshoot? I'm gonna play with the n layers and also double check the llm settings
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
Thats from the verbose output
I work with windows and ubuntu machine, lol so cant be of any help with Mac 😅
I'll pose the question on the main thread then
I think, you can wiat for Logan
Just as an update, I just tested my regular install of llama.cpp and the GPU usage goes up to 75%
so its definitely me having done somethign wrong
I suspect it occured when i tried my fresh install of llama-index
looks like i need to clean and try again later
Ah great! That's a good sign
Ok so i've fully removed llama-index and llama-cpp-python from my virtual environment
I suppose I need to ask if there are specific instructions to be followed for metal to work
The documentation on the page is outdated as it talks about llama-cpp-python version 1.6 whereas when i compiled and installed it, it was on 2.6
yeah i'm unable to get this to work
(I would just use ollama low-key, llama-cpp is a nightmare)
Interesting ok lemme try that
It’s just the implementation of llama-cpp through llama index that’s not working though
Regular llama-cpp works fine
I’ll try using ollama but it’s documentation to be used in conjunction with llama index feels like it’s only being used to point at a server that’s already running. I wanted to run a single instance in each script without running the server
Should I use a different llm integration?
@Logan M is there any way for me to get llama-cpp working in llamaindex?
Please let me know if tagging you is not allowed
Ollama is a server yes, it's just way easier to configure compared to llama.cpp. tbh I much prefer using it, but that's just me
I'm not a llama cpp expert. I just know there's super specific installation instructions, plus you have to set n_gpu_layers to -1
or some other non zero value
Yeah the gpu layers is specific and weird for sure
i think you have to set it to 1
also I may have somewhat figured out where the problem is coming from, where/who should i speak to incase its a bug? I'm not used to submitting bugs and requests, it will probably be my first time.
I think the llama-index llama cpp utils are not updated to use the gpu specific version of llama-cpp
Like, I know llama-cpp-python has specific instructions for installing on metal
CMAKE_ARGS="-DLLAMA_METAL=on" pip install -U llama-cpp-python --no-cache --force-reinstall
I tried it multiple times
Finally, I went backwards
I did the llama cpp install first
And then the code didn’t recognize the llm=LLAMACPP
Until I manually installed llama-index-LLMs-llama-cpp
So clearly the llamacpp parameters are being taken by something in llamaindex but it’s not for whatever reason able to hook into regular llama-cpp-python in a way which runs it in GPU form
I really would just use ollama and figure out the server thing. This isn't worth the headache to run an LLM at 20 Tokens/Second lol
Ok I'll take a look at the link you've sent as well then.
I hope you don't mind that I'm fixated on Llamacpp... as far as I know Ollama doesn't work with my use case which is that each script will run a different LLM each time. I'll attempt to ask about it on the main group again if that's alright. Thanks for your help thus far.
I look a look at the link you sent and interestingly there's no field for n_gpu_layers in that
i think that parameter isn't getting passed through to llamacpp
n_gpu_layers is passed in with model_kwargs
llm = LlamaCPP(
...
model_kwargs={"n_gpu_layers": -1},
)
I think its a positive 1 if I'm not wrong... I'll try the negative 1 just to be sure
I'm going off the documentation
-1 will offload all layers to GPU