Llm pipeline

This is all running locally

The first time you run, the model weights are downloaded. That's the only part that needs an internet connection

where are they referenced?

All the models come from huggingface, they just get downloaded and then ran locally using that pipeline abstraction

https://huggingface.co/docs/transformers/main_classes/pipelines

Many text generation models you can use from huggingface https://huggingface.co/models?pipeline_tag=text-generation&sort=downloads

ok I see - they have to be hosted by huggingface?

is that correct? can i host my own on HF if I want to?

No, only the weights are hosted

When you run the code foe the first time, it downloads the weights automatically

Then everything runs locally

You can pre-download the weights if you want. But no data is leaving your computer

aautratec

Suggest u move your "pipeline" line outside of class to save some GPU memory 😁

yeah that line seems to be problematic

at least, it's taking prohibitively long to do any inference on a prompt with two small .pdfs as ./data

(using device="cpu")

any idea why that might be @Logan M ?

Cpu is extremely slow for most models (like, a snails pace haha). Really only LlamaCPP will give any reasonable response time since it's being optimized for cpu

even for inference?

I must be missing something - does everyone running these locally use a GPU? I don't often see that listed a requirement ...

Yea most people use a GPU. A common method is to use colab, since their GPUs are free (but time limited).

You wont get a reasonable response speed from most CPU models, especially when the input starts to get big.

This is just my personal experience though lol

that's weird though because I watched my co worker run github.com/nomic-ai/gpt4all on his mac locally, without a GPU

it ran in reasonable time

what is the difference?

Bigger inputs?

not sure what you mean by that. you mean any documents?

I have two .pdfs in ./data for example

(and then I have a one sentence question I am asking with .query)

index.query

Like, the bigger the prompt gets to the LLM, the slower its going to get.

That one sentence query gets put into a prompt template, along with text that indexed, and it can easily be 1000 tokens in the input (also depends on some other settings like chunk size limit)

In any case, did you see this notebook for gpt4all? https://github.com/autratec/GPT4ALL_Llamaindex

yes I see it - not sure how it helps me though I'm sure it can

Yea it should help since it shows exactly how you can use gpt4all with llama index 💪 I think you just need to change device_map='auto' to device_map='cpu' to run on cpu?

If you don't have cuda or anything installed, auto should use cpu as well

still having some trouble here ...

the repo above assumes a GPU, and I can't quite get the right configuration to use CPU

I don't know how to replace this line with correct, non cuda, code.

Plain Text

input_ids = inputs["input_ids"].cuda()

@autratec any idea?

aautratec

You should use GPU. If u r using Colab , need to use pro.

why though? If I can run inference on my computer using these instructions for chatgpt4all ...

Plain Text

Try it yourself
Here's how to get started with the CPU quantized GPT4All model checkpoint:

Download the gpt4all-lora-quantized.bin file from Direct Link or [Torrent-Magnet].
Clone this repository, navigate to chat, and place the downloaded file there.
Run the appropriate command for your OS:
M1 Mac/OSX: cd chat;./gpt4all-lora-quantized-OSX-m1
Linux: cd chat;./gpt4all-lora-quantized-linux-x86
Windows (PowerShell): cd chat;./gpt4all-lora-quantized-win64.exe
Intel Mac/OSX: cd chat;./gpt4all-lora-quantized-OSX-intel
For custom hardware compilation, see our llama.cpp fork.

I can get that up and running on my CPU. Runs inference fine

Just need to remove the .cuda() from this line

thanks

is there a way to save the model to disk at any point? Every time I call these three methods it takes forever...

Plain Text

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
gpt4all_model = PeftModel.from_pretrained(model, peft_model_id)

Attachment

Screen_Shot_2023-04-21_at_11.24.56_AM.png

should I try cache_dir?

(I'm going to, just wondering if there's anything else you recommend)

like, maybe I don't need my own tokenizer? Maybe I do? Idk this stuff yet

also what is PeftModel doing here?

It's probably just slow to load into memory, its a big model, it should be cached already though so you don't have to download it again. Also, you do need the tokenizer.

Basically pfet loads the base model (first line, which should be cached) and then configures it with an adapter (3rd line) that adds extra trained parameters

Usually in an app, you'll load the model once and keep it initialized globally.

You can read more about pfet here:
https://huggingface.co/blog/peft

https://github.com/huggingface/peft/tree/main

I see so it loads once, at start up, then just sits and waits for requests

what are the extra trained parameters you speak of?

It's related to how Pfet works. You take a single LLM (config.base_model_name_or_path), and load it. Then you add a few more parameters to it as an "adapter" and fine-tune ONLY those new parameters. This makes it possible to fine-tune an LLM without using a ton of resources. This already done ahead of time, we are just loading the weights from that process.

So now, when you use the model, you load the base model again, and the pfet library is loading the trained "adapter" weights.

Well, if you design your app/server that way, yes 👌

ah ok so I am seeing that the config.base_model_name_or_path resolves to zpn/llama-7b and then the peft_model_id is gpt4all-lora

what is actually being fine tuned?

this is done ahead of time you say, does it require this exact combination (permutation?) of llama-7b then gpt4all-lora ?

I am just downloading weights, like you say

Yea it requires that exact combination.

If you were training your own model instead though, you could do any type of combination you want

When they trained the model, just the pfet adapter was fine tuned. The llama model was just used as the base (which was also already trained)

sorry don't mean to be pedantic but is there a difference betwen PEFT and pfet?

nah I'm just mispelling it haha

haha ok cool

is there an easy way to inspect what is sent in the request? i really want to know when I use a GPTSimpleVectorIndex with say, k_simliarity = 3 what gets sent exactly

Are you still using GPT4ALL? I would just add to the _call function -> print(prompt) to print the input to the model each time

Otherwise, there is a llama_logger abstraction that also records the inputs/outputs to the models (bottom of this notebook)
https://github.com/jerryjliu/llama_index/blob/main/examples/vector_indices/SimpleIndexDemo.ipynb

oh cool

yeah I am

what does the GPT4ALL Lora add to the zpn/llama7b

is there a paper that covers that?

what does the fine tuning attempt to fix/augment/change?

GPT4ALL was fine tuned on a bunch of prompts generated using gpt4, to try and make it a little better at following instructions and answering questions. I don't think there's a paper for this specific model though

Rather than training the entire llama model further (expensive) you can add a few more parameters to the original model and fine tune only those. The hughingface blog I linked earlier goes over how it works generally.

Where does the zpn/llama7b come from?

Facebook?

Well, kinda. I'm not sure if Facebook ever officially released the model. Original you had to sign a form to get access to the weights, but then the weights leaked and the internet ran wild with it.

Tbh it's pretty annoying, since most of these models are technically non-commercial

But people still continue to use llama or its dataset, perpetuating that non-commercial component, super lame

Even the dataset gpt4all is trained on goes against openAI TOS. So it's kind of a gray area

oh wow

thanks for that important knowledge

Finally got around to getting a new big gpu to run this locally, but the response I'm getting is always "Empty Response". This is using the full gpt4all, basically copy-pasted from the link above

Which link did you copy? This one? https://github.com/autratec/GPT4ALL_Llamaindex

Whoops, discord search put me in the wrong thread chain, but yes, that one

I'm not sure what's causing the empty response. In the custom LLM definition, try printing print(len(prompt.split(' '))), and also print what is returned by the llm pipeline.

The first print should show a length smaller than ~1500 at most. If it's larger, double check your prompt helper settings I think?

wdym the model pipeline? the _call function in the custom model class?

Yea that's the one (sorry, forgot this one wasn't using a huggingface pipeline)