The post indicates an error when trying to load a model file, and the comments discuss issues with using the llama.cpp library. Community members suggest using ollama instead, as it has less configuration overhead and can run any gguf model. They also mention encountering performance issues with llama.cpp, such as high CPU usage, and provide links to alternative resources for working with Python-based LLMs.
eventually i got it run but for 1, i have this warning: .conda/lib/python3.12/site-packages/llama_cpp/llama.py:1138: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it... warnings.warn(
and for 2, it was killing my cpu, it gone up to 99% of usage, i rarely had any such issues with ollama, even with much larger llm, which is very strange, i thought llama-index is much more optimized. but i am very closely following what ever code was given in the tutorial, so i am not sure what i am missing here.
my idea is to have much more granular control over my hardware, such that i can customize gpu layers and do benchmarks or other applications with python.