Find answers from the community

Updated last month

Value Error: Failed to Load Model from File

At a glance

The post indicates an error when trying to load a model file, and the comments discuss issues with using the llama.cpp library. Community members suggest using ollama instead, as it has less configuration overhead and can run any gguf model. They also mention encountering performance issues with llama.cpp, such as high CPU usage, and provide links to alternative resources for working with Python-based LLMs.

Useful resources
ValueError: Failed to load model from file: /tmp/llama_index/models/llama-2-13b-chat.ggmlv3.q4_0.bin
L
l
13 comments
ggml is not supported by llama.cpp anymore, its a very old format
but tbh, I wouldn't use llama.cpp
use ollama, there is way too much config to worry about with llama.cpp
ollama can run any gguf model
eventually i got it run but for 1, i have this warning: .conda/lib/python3.12/site-packages/llama_cpp/llama.py:1138: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it...
warnings.warn(
and for 2, it was killing my cpu, it gone up to 99% of usage, i rarely had any such issues with ollama, even with much larger llm, which is very strange, i thought llama-index is much more optimized. but i am very closely following what ever code was given in the tutorial, so i am not sure what i am missing here.
my idea is to have much more granular control over my hardware, such that i can customize gpu layers and do benchmarks or other applications with python.
that was my intention
i am also considering follow this guide to know more about python-llama-cpp
it does not seem to be using llama-index
Add a reply
Sign up and join the conversation on Discord