Find answers from the community

Updated 2 years ago

Llm pipeline

At a glance
Plain Text
class CustomLLM(LLM):
    model_name = "facebook/opt-iml-max-30b"
    pipeline = pipeline("text-generation", model=model_name, device="cuda:0", model_kwargs={"torch_dtype":torch.bfloat16})

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)
        response = self.pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"]

        # only return newly generated tokens
        return response[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"
1
L
n
a
89 comments
This is all running locally
The first time you run, the model weights are downloaded. That's the only part that needs an internet connection
where are they referenced?
All the models come from huggingface, they just get downloaded and then ran locally using that pipeline abstraction
ok I see - they have to be hosted by huggingface?
is that correct? can i host my own on HF if I want to?
No, only the weights are hosted
When you run the code foe the first time, it downloads the weights automatically
Then everything runs locally
You can pre-download the weights if you want. But no data is leaving your computer
Suggest u move your "pipeline" line outside of class to save some GPU memory 😁
yeah that line seems to be problematic
at least, it's taking prohibitively long to do any inference on a prompt with two small .pdfs as ./data
(using device="cpu")
any idea why that might be @Logan M ?
Cpu is extremely slow for most models (like, a snails pace haha). Really only LlamaCPP will give any reasonable response time since it's being optimized for cpu
even for inference?
I must be missing something - does everyone running these locally use a GPU? I don't often see that listed a requirement ...
Yea most people use a GPU. A common method is to use colab, since their GPUs are free (but time limited).

You wont get a reasonable response speed from most CPU models, especially when the input starts to get big.

This is just my personal experience though lol
that's weird though because I watched my co worker run github.com/nomic-ai/gpt4all on his mac locally, without a GPU
it ran in reasonable time
what is the difference?
Bigger inputs?
not sure what you mean by that. you mean any documents?
I have two .pdfs in ./data for example
(and then I have a one sentence question I am asking with .query)
Like, the bigger the prompt gets to the LLM, the slower its going to get.

That one sentence query gets put into a prompt template, along with text that indexed, and it can easily be 1000 tokens in the input (also depends on some other settings like chunk size limit)

In any case, did you see this notebook for gpt4all? https://github.com/autratec/GPT4ALL_Llamaindex
yes I see it - not sure how it helps me though I'm sure it can
Yea it should help since it shows exactly how you can use gpt4all with llama index πŸ’ͺ I think you just need to change device_map='auto' to device_map='cpu' to run on cpu?
If you don't have cuda or anything installed, auto should use cpu as well
still having some trouble here ...
the repo above assumes a GPU, and I can't quite get the right configuration to use CPU
I don't know how to replace this line with correct, non cuda, code.
Plain Text
input_ids = inputs["input_ids"].cuda()
@autratec any idea?
You should use GPU. If u r using Colab , need to use pro.
why though? If I can run inference on my computer using these instructions for chatgpt4all ...


Plain Text
Try it yourself
Here's how to get started with the CPU quantized GPT4All model checkpoint:

Download the gpt4all-lora-quantized.bin file from Direct Link or [Torrent-Magnet].
Clone this repository, navigate to chat, and place the downloaded file there.
Run the appropriate command for your OS:
M1 Mac/OSX: cd chat;./gpt4all-lora-quantized-OSX-m1
Linux: cd chat;./gpt4all-lora-quantized-linux-x86
Windows (PowerShell): cd chat;./gpt4all-lora-quantized-win64.exe
Intel Mac/OSX: cd chat;./gpt4all-lora-quantized-OSX-intel
For custom hardware compilation, see our llama.cpp fork.
I can get that up and running on my CPU. Runs inference fine
Just need to remove the .cuda() from this line
is there a way to save the model to disk at any point? Every time I call these three methods it takes forever...

Plain Text
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
gpt4all_model = PeftModel.from_pretrained(model, peft_model_id)
should I try cache_dir?
(I'm going to, just wondering if there's anything else you recommend)
like, maybe I don't need my own tokenizer? Maybe I do? Idk this stuff yet
also what is PeftModel doing here?
It's probably just slow to load into memory, its a big model, it should be cached already though so you don't have to download it again. Also, you do need the tokenizer.

Basically pfet loads the base model (first line, which should be cached) and then configures it with an adapter (3rd line) that adds extra trained parameters

Usually in an app, you'll load the model once and keep it initialized globally.

You can read more about pfet here:
https://huggingface.co/blog/peft

https://github.com/huggingface/peft/tree/main
I see so it loads once, at start up, then just sits and waits for requests
what are the extra trained parameters you speak of?
It's related to how Pfet works. You take a single LLM (config.base_model_name_or_path), and load it. Then you add a few more parameters to it as an "adapter" and fine-tune ONLY those new parameters. This makes it possible to fine-tune an LLM without using a ton of resources. This already done ahead of time, we are just loading the weights from that process.

So now, when you use the model, you load the base model again, and the pfet library is loading the trained "adapter" weights.
Well, if you design your app/server that way, yes πŸ‘Œ
ah ok so I am seeing that the config.base_model_name_or_path resolves to zpn/llama-7b and then the peft_model_id is gpt4all-lora
what is actually being fine tuned?
this is done ahead of time you say, does it require this exact combination (permutation?) of llama-7b then gpt4all-lora ?
I am just downloading weights, like you say
Yea it requires that exact combination.

If you were training your own model instead though, you could do any type of combination you want
When they trained the model, just the pfet adapter was fine tuned. The llama model was just used as the base (which was also already trained)
sorry don't mean to be pedantic but is there a difference betwen PEFT and pfet?
nah I'm just mispelling it haha
haha ok cool
is there an easy way to inspect what is sent in the request? i really want to know when I use a GPTSimpleVectorIndex with say, k_simliarity = 3 what gets sent exactly
Are you still using GPT4ALL? I would just add to the _call function -> print(prompt) to print the input to the model each time
Otherwise, there is a llama_logger abstraction that also records the inputs/outputs to the models (bottom of this notebook)
https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/SimpleIndexDemo.ipynb
what does the GPT4ALL Lora add to the zpn/llama7b
is there a paper that covers that?
what does the fine tuning attempt to fix/augment/change?
GPT4ALL was fine tuned on a bunch of prompts generated using gpt4, to try and make it a little better at following instructions and answering questions. I don't think there's a paper for this specific model though

Rather than training the entire llama model further (expensive) you can add a few more parameters to the original model and fine tune only those. The hughingface blog I linked earlier goes over how it works generally.
Where does the zpn/llama7b come from?
Well, kinda. I'm not sure if Facebook ever officially released the model. Original you had to sign a form to get access to the weights, but then the weights leaked and the internet ran wild with it.

Tbh it's pretty annoying, since most of these models are technically non-commercial
But people still continue to use llama or its dataset, perpetuating that non-commercial component, super lame
Even the dataset gpt4all is trained on goes against openAI TOS. So it's kind of a gray area
thanks for that important knowledge
Finally got around to getting a new big gpu to run this locally, but the response I'm getting is always "Empty Response". This is using the full gpt4all, basically copy-pasted from the link above
Whoops, discord search put me in the wrong thread chain, but yes, that one
I'm not sure what's causing the empty response. In the custom LLM definition, try printing print(len(prompt.split(' '))), and also print what is returned by the llm pipeline.

The first print should show a length smaller than ~1500 at most. If it's larger, double check your prompt helper settings I think?
wdym the model pipeline? the _call function in the custom model class?
Yea that's the one (sorry, forgot this one wasn't using a huggingface pipeline)
the split prompt length is 22 in this case, and the full response fom the model is it echoing the prompt and then nada :\
(since the actual return from that is response[len(prompt):] , we get an empty response)
Those open source llm is not strong to handle refine peocess. After couple round back and forth, they will lose in the conversation. Pls set the k=1 and just use first answer for response, you might get something.
Add a reply
Sign up and join the conversation on Discord