The community member has fine-tuned the Llama3.1-8B model for the Text2SQL task and has two questions:
1) How can they load the locally saved fine-tuned model in LlamaIndex?
2) How can they use quantization to load the model on their GPU?
The community member tried pushing the model to HuggingFace and downloading it using the HuggingFace() LLM class in LlamaIndex, but the model was not loaded on the GPU.
In the comments, other community members suggest:
Checking if CUDA is installed, as it should automatically try to load the model onto the GPU.
Loading the model and tokenizer directly with HuggingFace and putting them on the GPU themselves, then passing them to the HuggingFaceLLM class.
Providing an example code snippet for loading the model with quantization.
There is no explicitly marked answer in the comments.
I fine-tuned Llama3.1-8B for the Text2SQL task, and now I have two questions:
1) How can I load the locally saved finetuned model in LlamaIndex? 2) How can I use quantization to load the model on my GPU?
I tried pushing the model to HuggingFace and downloading it using the HuggingFace() LLM class in LlamaIndex (such as the other LLMs); however, the model is not loaded on the GPU.