Find answers from the community

Updated 3 months ago

Hey guys I feel stupid posing this but

Hey guys I feel stupid posing this but after quantizing the model like this llm = HuggingFaceLLM(
model_name="HuggingFaceH4/zephyr-7b-alpha",
tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
context_window=3900,
max_new_tokens=256,
model_kwargs={"quantization_config": quantization_config},
# tokenizer_kwargs={},
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
device_map="auto",
) how are we supposed to store the quantized model locally
L
A
20 comments
model.save_pretrained("path/to/save") I think?
that won't work, since you can't access the model from the HuggingFaceLLM class
that was the problem i got
you'd have to quantize outside of llama-index
then save/load from there
can i get bigbloke's model
or is there any different that i do quantization myself
you probably could, assuming it's not a gguf/ggml model

Just change the model name and remove the quantization config
he actually has gguf and a gptu format
would it be loadable w index
Probably some way to load it then, and then pass it in directly

Plain Text
model = <load huggignface model>
llm = HuggingFaceLLM(model=model, ...)
thank you bro! 🥲
im running on a 4070 laptop actually can I get away with not using quantizing for Zephyr-7B? My supervisor asked me to keep it lightweight but if my PC can run it Ill call it lightweight🤣
Mmmm how many GB is that GPU? Non-quanitized you probably need at least 16GB VRAM I think ..
16gb w 8gb vram... I tried to make it work im getting no error but it's slow affff. I m working on getting quantized version. Many thanks !
Add a reply
Sign up and join the conversation on Discord