Find answers from the community

Updated 3 months ago

Connecting huggingface transformers model to huggingface library offline

Every tutorial about HuggingFaceLLM has two part: download LLM and then infer it. What about if I am building a product to be used offline and i just want to do 2nd part. I need to supply the LLM with the software package. How to connect this LLM to HuggingFaceLLM?
t
J
6 comments
It covers both how to use a huggingface llm locally and via huggingface inference APIs
As i said, that link provide example about how to use a LLM which was just downloaded through HuggingFaceLLM class. But it doesn't talks about how to use a model already saved in local machine.
I think it's the same code unless the LLM is saved in a diff local machine folder. When downloaded, huggingface LLMs are all saved in one folder. So when you call the LLM (according to its huggingface model card), it'll first check this folder to see if it's there, and if it isn't, it'll pull it down and load it.
But just in case, this was how I did it last time using a very old version of LlamaIndex

Plain Text
import torch
import transformers
from transformers import (
    StoppingCriteria, 
    StoppingCriteriaList,
    TextStreamer
)
from llama_index.llms.huggingface import HuggingFaceLLM
from langchain_community.llms import HuggingFacePipeline
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

hf_token = os.getenv("HUGGINGFACE_API_TOKEN")
model_config = transformers.AutoConfig.from_pretrained(model_id, use_auth_token=hf_token)
model = transformers.AutoModelForCausalLM.from_pretrained(model_id,
 trust_remote_code = True,
 config = model_config,
 quantization_config = <your bnb config>,
 device_map = device,
 use_auth_token = hf_token,
 cache_dir = "path to downloaded model"
)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id, use_auth_token = hf_token,cache_dir = "path to downloaded model")
llm = HuggingFaceLLM(context_window = ..., max_new_tokens = ..., system_prompt=..., model=..., tokenizer =...)

## Wrap to langchain pipeline
stop_list = ['\n`\n']
stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
       for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                  return True
            return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])
streamer = TextStreamer(tokenizer, skip_prompt = True)
pipeline = transformers.pipeline(model = model,tokenizer = tokenizer,return_full_text = True,task = "text-generation",
   stopping_criteria = stopping_criteria,
   streamer = streamer,
   temperature = temperature,
   repetition_penalty = 1.1
 )
langchain_pipeline = HuggingFacePipeline(pipeline=pipeline)
If you do read the source code of the current version of LlamaIndex (https://github.com/run-llama/llama_index/blob/main/llama-index-integrations/llms/llama-index-llms-huggingface/llama_index/llms/huggingface/base.py), you'll notice that the implementation is pretty similar (they're calling AutoTokenizer and AutoModelForCausalLM too), and they added the same StoppingCriteria that I did when I wrapped it in my langchain pipeline so it's already more advanced. There's a catch-all input **kwargs so you can just key in your "path to downloaded model" under "cache_dir" if it's saved differently from the usual huggingface models save directory.

You can ignore the 2nd part - I was also experimenting with langchain and decided to only load it from cache once so I wrapped a langchain pipeline object over
Add a reply
Sign up and join the conversation on Discord