Gpu

NNicholasYC

GPU out of memory error when create index from documents, the GPU RAM usage suddenly grow up to 14.7GB from 5.5GB. how to fix it?

Attachments

13 comments

LLogan M

Probably need more memory 🤷‍♂️

NNicholasYC

Just want to figure out why the memory grows up suddenly when create index on a small (25KB) document?

LLogan M

Memory grows as it sees longer and longer input sequences. It stops growing once the model has seen its longest input (I.e. 3900 tokens in this case)

LLogan M

This is common across any LLM on huggingface running on GPU

VVicent W.

@NicholasYC , I noticed your RAM usage grew up suddenly from almost zero. That's probably because the code is loading the model at that time.

The model seems to have been read into the memory only when you start using it, a trick called "lazy loading".

The memory usage (~14GB) matches the size of the model that you were trying to load, which confirms my guess above.

Would you be OK to use a smaller model instead, such as hfl/chinese-alpaca-2-1.3b? It's only ~3GB.

Attachment

VVicent W.

https://huggingface.co/hfl/chinese-alpaca-2-1.3b/tree/main

VVicent W.

（如果你讲中文且需要帮助，可以给我发送私信。祝好！）

NNicholasYC

Thanks for your help

NNicholasYC

I don't think this is because of the 7b model is too big. The same model, but index on a smaller file(8KB) is ok. I think the OOM error is because of creating index on documents.（非常感谢您的帮助）

Attachments

VVicent W.

You're right. That's indeed a giveaway that it's about the size of the file that you're trying to index.

My guess is that your file contains more tokens than the context window you specified. An intuitive solution that came to my mind is to increase the context_window value, but at the same time I know it's limited by the LLM you're using, so it seems indeed to be a dead end.

NNicholasYC

I guess when I initiate the HuggingFaceLLM at the first time, the llm loaded into GPU and take 4.3GB RAM(the first picture). Then when I try to create index on the documents, the llm loaded again and take another 4.3GB RAM(the third picture). The second picture tells me the embed model used in service_context takes 1.2 GB GPU RAM. Finally there are 2 same llm take 8.6GB RAM and 1 embed model take 1.2GB RAM. This is just a guess.

LLogan M

A smaller file won't use the full context of the LLM either though, which is why it works.

If you run
llm.complete("hi") I bet it would work and use a set amount of memory.

But if you run llm.comete("hi " * 50) the memory usage would increase, but level off if you ran it again. But increase that multiplier, and more memory will be allocated, since you are sending text through newer un-allocated parts of the model.

This memory increase stops when the model has seen the largest input size possible and then levels off.

But you are running out of memory before it gets there

NNicholasYC

Thanks a lot. Maybe I can use multiple smaller docs instead of a big one. BTW, happy 2024!

Add a reply

Find answers from the community

Gpu