Cuda

At a glance

The community member has fine-tuned the Llama3.1-8B model for the Text2SQL task and has two questions:

1) How can they load the locally saved fine-tuned model in LlamaIndex?

2) How can they use quantization to load the model on their GPU?

The community member tried pushing the model to HuggingFace and downloading it using the HuggingFace() LLM class in LlamaIndex, but the model was not loaded on the GPU.

In the comments, other community members suggest:

Checking if CUDA is installed, as it should automatically try to load the model onto the GPU.
Loading the model and tokenizer directly with HuggingFace and putting them on the GPU themselves, then passing them to the HuggingFaceLLM class.
Providing an example code snippet for loading the model with quantization.

There is no explicitly marked answer in the comments.

AAlwin

Hello,

I fine-tuned Llama3.1-8B for the Text2SQL task, and now I have two questions:

1) How can I load the locally saved finetuned model in LlamaIndex?
2) How can I use quantization to load the model on my GPU?

I tried pushing the model to HuggingFace and downloading it using the HuggingFace() LLM class in LlamaIndex (such as the other LLMs); however, the model is not loaded on the GPU.

6 comments

LLogan M

Do you have cuda installed? It should automatically try to load it onto gpu

AAlwin

Yes I have, It works with other HF LLMs.

LLogan M

Try loading the model and tokenizer directly with huggingface and putting it on gpu yourself, then pass it in

HuggingFaceLLM(model=model, tokenizer=tokenizer, ...)

LLogan M

Then you know it's loaded the way you want 🤷‍♂️

AAlwin

I use this:

"My_FT_Model = "Alwiin/Llama-3.1-8B-Instruct-FT-Text2SQL"

selected_model = My_FT_Model

tokenizer = AutoTokenizer.from_pretrained(selected_model)

stopping_ids = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)

llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=3048,
model_name = selected_model,
model_kwargs={
# "torch_dtype": torch.bfloat16, # comment this line and uncomment below to use 4bit
"quantization_config": quantization_config
},
generate_kwargs={
"do_sample": True,
"temperature": 0.1,
"top_p": 0.9,
},
tokenizer_name = selected_model,
tokenizer_kwargs = {"max_length": 4096},
stopping_ids = stopping_ids,
system_prompt = system_prompt,
query_wrapper_prompt = query_wrapper_prompt,
device_map = "auto",
)"

AAlwin

I'll give it a try

Add a reply

Find answers from the community

Cuda