hf_token = "your_huggingface_token" from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", token=hf_token, )
"your_huggingface_token"
with your actual HuggingFace token. This will authorize your HuggingFace account and allow you to use the Meta-Llama-3-8B model.from llama_index.llms.huggingface import HuggingFaceLLM llm = HuggingFaceLLM( model_name="meta-llama/Meta-Llama-3-8B-Instruct", model_kwargs={ "token": hf_token, "torch_dtype": torch.bfloat16, # comment this line and uncomment below to use 4bit # "quantization_config": quantization_config }, generate_kwargs={ "do_sample": True, "temperature": 0.6, "top_p": 0.9, }, tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct", tokenizer_kwargs={"token": hf_token}, stopping_ids=stopping_ids, )
stopping_ids
is a list of token ids at which the model should stop generating further tokens. You can define it as per your requirements.!pip install llama-index !pip install llama-index-llms-huggingface !pip install llama-index-embeddings-huggingface !pip install llama-index-embeddings-huggingface-api
huggingface-cli login
or by passing token=<your_token>
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct' hf_token = my_token tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token) stopping_ids = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] llm = HuggingFaceLLM( context_window=8192, max_new_tokens=256, generate_kwargs={"temperature": 0.1, "do_sample": False}, system_prompt=system_prompt, query_wrapper_prompt=query_wrapper_prompt, tokenizer_name=model_name, model_name=model_name, device_map="auto", stopping_ids=stopping_ids, tokenizer_kwargs={"max_length": 4096}, # uncomment this if using CUDA to reduce memory usage model_kwargs={"token": hf_token, "torch_dtype": torch.float16} ) Settings.llm = llm Settings.chunk_size = 512
my_token
with your actual HuggingFace token, and system_prompt
and query_wrapper_prompt
with the actual prompts you want to use.