Find answers from the community

Updated last year

I see `Using default LlamaCPP llama2 13b

At a glance
I see Using default LlamaCPP=llama2-13b-chat when following the tutorial. What if I want to use TheBloke/Platypus2-70B-Instruct-GPTQ instead? Having a hard time finding any info on llama-index + GPTQ.
d
b
L
20 comments
And default is a GGML CPU model...
but if platypus2 70b is based of llamacpp than you can use llama cpp and just point the model path to that? but not sure what GPTQ is
4-bit Quantized
so that I can fit a 70B model on a 48GB GPU
nice, you can either do a custom LLM or use hugging face
LOL yea, not many GGUF models out there yet. Been meaning to update this
Right, GGUF is the new GGML...for CPU folks.
any advice for keeping up with all of this stuff?!
The only time I've run transformers on CPU is when I've quantized a google/t5-efficient-mini model using CTranslate2.
TheBloke's Discord server is a great place for GGML, GGUF, GPTQ stuff.
He also uploads several quantized models to the HF hub every day. Just gotta check https://huggingface.co/models?sort=modified&search=thebloke
And the elephant in the room is that Meta still hasn't released the 34B version of Llama2, which (when quantized) would fit on a 24GB GPU. That is going to be a game changer.
@bmax @Logan M I tried to use a GPTQ model with LlamaCPP, and this is the error I got:
Plain Text
llama.cpp: loading model from /opt/gptq/models/TheBloke_OpenOrca-Platypus2-13B-GPTQ/gptq_model-4bit-128g.safetensors
error loading model: unknown (magic, version) combination: 000288b0, 00000000; is this really a GGML file?
llama_init_from_file: failed to load model
Sounds like maybe it's hard-coded to prefer GGML.
I'm not even sure if gptq works with llamacpp 🤔

If your Llama cpp version is 0.1.78 or older, it can use ggml (quantiized up tk 4bits, maybe even less tbh)

Newer versions expect gguf files
ohhhh. I had 0.1.53 installed. oops.
Add a reply
Sign up and join the conversation on Discord