Find answers from the community

Updated 3 months ago

CustomLLM

Hello! I'm working on getting Vicuna 13B to operate on an A100, but I keep encountering a CUDA memory error.

Could there be an issue with the code following the tutorial? Or is it more likely that I need to change this Vicuna model because it's not well-optimized? I appreciate any help you can provide! πŸ™‚
Attachments
Capture_decran_2023-05-27_a_11.02.57.png
Capture_decran_2023-05-27_a_10.07.19.png
L
C
48 comments
I've noticed that putting the pipeline/model inside the custom LLM class GREATLY increases memory usage... I think it's related to pydantic in langachain

One solution is to just move the pipeline out of the class, and use it as a global variable

Another option is using the new huggingface LLM predictor
An A100 should be more than enough for that model
Thanks for the heads up @Logan M!

Is the new huggingface LLM predictor fixed ?

When I tried this code couple days ago, I had the following error :
Attachment
image.png
Oh crap no hahaha
I need to fix that
Huggingface added that silly check. No reason for that to raise an error πŸ˜…
ahah! So I am stuck with this error with huggingface LLM and the custom LLM that uses all the memory of my A100 when using vicuna πŸ˜…
well, actually moving the pipeline out of the llm class should resolve the memory issue!

something like this roughly

Plain Text
vicuna_pipe = pipeline(...)

class CustomLLM(LLM):
  ...
  def _call(...):
    res = vicuna_pipe(...)
  ...
It works! Great thank you @Logan M.
I hope your examples of tutorials will be updated πŸ™‚
@Logan M CUDA memory error huggyllama/llama-30b in a A100 40GB. Specs should be OK in theory. Any tips to make it run?
Thanks in advance.
I think 30b will definitely be cutting it close. Are you using load_in_8bit?
device_map="auto" should also help (but if it's slow, that means it's offloading some weights off the gpu)
Thanks @Logan M for the tips!

I replaced this line
pipe = pipeline("text-generation", model=MODEL_NAME, device="cuda:0", model_kwargs{"torch_dtype":torch.bfloat16})

with the following lines:
Plain Text
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)

base_model = LlamaForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_8bit=True,
    device_map='auto',
)

pipe = pipeline("text-generation", tokenizer=tokenizer, model=base_model)


I don't know if there was an easier solution. But this one works!
yea that works! Glad it loads now :dotsCATJAM:
Hey @Logan M. It worked on a A100 40GB in Google Colab notebooks. But couple days ago, it failed on a A100 40GB in a cluster at work (Cuda memory error). Any other tips to make it work ? πŸ™‚
That's actually pretty weird. The exact same code works in a notebook, but not on your cluster? πŸ‘€
Yeah it is weird. Got this error on the cluster. I don't know if it is related to the CUDA version installed.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 456.00 MiB (GPU 0; 39.41 GiB total capacity; 36.78 GiB already allocated; 268.50 MiB free; 38.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Mmm nah, that's a standard error


Does the error happen when you initialize the model? Or only after you use it for a bit?
This error happens after this line
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 242/242 [01:39<00:00, 2.44it/s]
I couldn't use the model
Hmmm... do you have ssh access to this server?
Yes I have ssh access to the server (cluster nodes of my University)
What happens if you run nvidia-smi right now, on the server?
I'm curious if anything else is using gpu memory before you launch the model (or even how much memory you have access to)
I will try tomorrow and I will let you know!
Sounds good! I have a feeling you might not have access to the full 40GB for whatever reason πŸ‘€πŸ‘
@Logan M
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:E2:00.0 Off | 0 |
| N/A 27C P0 37W / 250W | 40117MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 159457 C python3 40115MiB |
+-----------------------------------------------------------------------------+
Today the model is executing. But I have this error
Attachment
Capture_decran_2023-06-09_a_17.02.39.png
I have a few warnings as well.

2023-06-09 16:42:29.102941: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-09 16:42:30.085561: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
bin /users/tluong2/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
CUDA SETUP: CUDA runtime path found: /dcsrsoft/spack/arolle/v1.0/spack/opt/spack/linux-rhel8-zen2/gcc-10.4.0/cuda-11.6.2-hjqfaeelfbajionp4uptpb6grp2uheb6/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /users/tluong2/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 242/242 [01:27<00:00, 2.76it/s]
Running on local URL: http://127.0.0.1:7860
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
  • Avoid using tokenizers before the fork if possible
  • Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Is this before you launch the model? I see it's using nearly all the memory, which is a little scary lol
These warnings seem benign
after launching the model!
for ausboss/llama-30b-supercot
hmmm and this is with load_in_8bit=True right?
The only reason I can think of that it works on colab but not on your server is the torch version πŸ‘€
exactly!
Plain Text
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME)

base_model = LlamaForCausalLM.from_pretrained(
    MODEL_NAME,
    load_in_8bit=True,
    device_map='auto',
)

pipe = pipeline("text-generation", tokenizer=tokenizer, model=base_model)
Yea, maybe check the torch versions is my guess

pip show torch should tell you whats up between colab and your server
Thanks for your advice
@Logan M Another issue I have is the produced output for some models. Such a problem only happens for some models.

In general, as parameters, I often use this:
max_input_size = 2048
num_outputs = 256
max_chunk_overlap = 20
chunk_size_limit=512
Attachment
Capture_decran_2023-06-09_a_18.30.14.png
wait, the output is just underscores? lol
Some models might need some additional tuning, like top-p and top-k, as well as repetition penalty
Yea, depending on the model, you can pass some extra kwargs to the pipeline

text = pipe(text, top_p=0.5, top_k=0.5, temperate=0.0, repetition_penalty=1.5)

Might take a lot of experimenting depending on the model lol those are just guesses at stuff to change

Full explanation/list of params are here: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig
Thanks again for the resources!

So always experimental settings for the these parameters? No shared settings by the people that publish the model?
Sometimes in the model card you might find settings, or in the community section as well

It's a pretty community driven effort though haha
ahah thanks for the insight.
It would be great for LlamaIndex users to know the models that your team has tested with the recommended settings πŸ™‚
I agree! we actually want to build a database of models/settings so that users can automatically use our huggingface LLM Predcitor by just providing a model name. Just takes a lot of time to do this, but maybe this can also be community driven haha

Currently, there's a ton of setup as you can see haha
It would be really nice! Anyway, a great thanks for your work guys!
Add a reply
Sign up and join the conversation on Discord