I see `Using default LlamaCPP llama2 13b

At a glance

I see Using default LlamaCPP=llama2-13b-chat when following the tutorial. What if I want to use TheBloke/Platypus2-70B-Instruct-GPTQ instead? Having a hard time finding any info on llama-index + GPTQ.

20 comments

ddoughboy

And default is a GGML CPU model...

bbmax

You might be able to use a custom llm here:

https://gpt-index.readthedocs.io/en/stable/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

bbmax

but if platypus2 70b is based of llamacpp than you can use llama cpp and just point the model path to that? but not sure what GPTQ is

ddoughboy

4-bit Quantized

ddoughboy

so that I can fit a 70B model on a 48GB GPU

bbmax

nice, you can either do a custom LLM or use hugging face

ddoughboy

thank you!

LLogan M

LOL yea, not many GGUF models out there yet. Been meaning to update this

ddoughboy

Right, GGUF is the new GGML...for CPU folks.

bbmax

any advice for keeping up with all of this stuff?!

ddoughboy

The only time I've run transformers on CPU is when I've quantized a google/t5-efficient-mini model using CTranslate2.

ddoughboy

TheBloke's Discord server is a great place for GGML, GGUF, GPTQ stuff.

ddoughboy

He also uploads several quantized models to the HF hub every day. Just gotta check https://huggingface.co/models?sort=modified&search=thebloke

ddoughboy

https://discord.gg/g7Jnnzaj

ddoughboy

And the elephant in the room is that Meta still hasn't released the 34B version of Llama2, which (when quantized) would fit on a 24GB GPU. That is going to be a game changer.

ddoughboy

Also, be sure to check this out. https://github.com/PanQiWei/AutoGPTQ

ddoughboy

@bmax @Logan M I tried to use a GPTQ model with LlamaCPP, and this is the error I got:

Plain Text

llama.cpp: loading model from /opt/gptq/models/TheBloke_OpenOrca-Platypus2-13B-GPTQ/gptq_model-4bit-128g.safetensors
error loading model: unknown (magic, version) combination: 000288b0, 00000000; is this really a GGML file?
llama_init_from_file: failed to load model

ddoughboy

Sounds like maybe it's hard-coded to prefer GGML.

LLogan M

I'm not even sure if gptq works with llamacpp 🤔

If your Llama cpp version is 0.1.78 or older, it can use ggml (quantiized up tk 4bits, maybe even less tbh)

Newer versions expect gguf files

ddoughboy

ohhhh. I had 0.1.53 installed. oops.

Add a reply

Find answers from the community

I see `Using default LlamaCPP llama2 13b