Find answers from the community

Updated 2 months ago

Cuda

Hello,

I fine-tuned Llama3.1-8B for the Text2SQL task, and now I have two questions:

1) How can I load the locally saved finetuned model in LlamaIndex?
2) How can I use quantization to load the model on my GPU?

I tried pushing the model to HuggingFace and downloading it using the HuggingFace() LLM class in LlamaIndex (such as the other LLMs); however, the model is not loaded on the GPU.
L
A
6 comments
Do you have cuda installed? It should automatically try to load it onto gpu
Yes I have, It works with other HF LLMs.
Try loading the model and tokenizer directly with huggingface and putting it on gpu yourself, then pass it in

HuggingFaceLLM(model=model, tokenizer=tokenizer, ...)
Then you know it's loaded the way you want πŸ€·β€β™‚οΈ
I use this:


"My_FT_Model = "Alwiin/Llama-3.1-8B-Instruct-FT-Text2SQL"

selected_model = My_FT_Model

tokenizer = AutoTokenizer.from_pretrained(selected_model)

stopping_ids = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=3048,
model_name = selected_model,
model_kwargs={
# "torch_dtype": torch.bfloat16, # comment this line and uncomment below to use 4bit
"quantization_config": quantization_config
},
generate_kwargs={
"do_sample": True,
"temperature": 0.1,
"top_p": 0.9,
},
tokenizer_name = selected_model,
tokenizer_kwargs = {"max_length": 4096},
stopping_ids = stopping_ids,
system_prompt = system_prompt,
query_wrapper_prompt = query_wrapper_prompt,
device_map = "auto",
)"
I'll give it a try
Add a reply
Sign up and join the conversation on Discord