Hello everyone,

At a glance

A community member is using llama-index to enhance their local knowledge base and has encountered a memory issue when attempting to use the GritLM/GritLM-7B model. The model is expected to occupy around 28GB of memory, but the actual memory consumption seems to exceed 100GB. The community member has shared their code snippet and is seeking insights or suggestions to address this memory discrepancy.

In the comments, other community members suggest lowering the batch size and setting the embed_batch_size to 1 to reduce the memory usage. They also mention that the length of the input can impact memory consumption and recommend setting a max_length parameter for the HuggingFaceEmbedding model. After trying these suggestions, the community member reports that the issue has been resolved.

Useful resources

AArthur

Hello everyone,

I recently started working with llama-index to enhance my local knowledge base, and I've found it to be a fantastic tool for my needs. However, I've encountered a memory issue that I'm struggling to solve.

I attempted to use the GritLM/GritLM-7B model, which, to my understanding, should occupy approximately 28GB of memory. Given that I'm using an A40 GPU, I anticipated that this would be sufficient. Surprisingly, when I run the model the actual memory consumption seems to exceed 100GB ( when I switched to device=cpu)

I'm puzzled about where the problem lies. Below is my code snippet:

Plain Text

import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.node_parser import SentenceSplitter,MarkdownNodeParser
from llama_index.extractors.entity import EntityExtractor
from llama_index.core import Settings
import tiktoken
from llama_index.core.extractors import TitleExtractor

llm = AzureOpenAI(
    model="gpt-35-turbo",
    deployment_name='xxx',
    api_key="xxx",
    azure_endpoint="https://xxx.openai.azure.com/",
    api_version="2023-07-01-preview",
    )
embed_model=HuggingFaceEmbedding(model_name="GritLM/GritLM-7B",cache_folder='/home/username/model_cache',device="cuda",)

Settings.llm = llm
Settings.transformations = [SentenceSplitter(chunk_size=4096,paragraph_separator = '\n\n')]
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True)

I'm reaching out to see if anyone has experienced similar issues or if there's something I'm overlooking in my setup. Any insights or suggestions on how to address this memory discrepancy would be greatly appreciated.

Thank you in advance for your help!

18 comments

LLogan M

Try lowering the batch size

LLogan M

HuggingFaceEmbedding(..., embed_batch_size=1)

LLogan M

Each batch size will increase the memory by ~28GB

AArthur

Thank you Logan, I tried embed_batch_size=1 and insert_batch_size=10. However there's still OOM issue when embedding. Should I use other Store index, or maybe remove cache after each batch?

BTW, I have two A40s and NVLINK but I have no idea if I could share the memory in llama_index. I would be grateful if you could give me some guidance. 🙂

AArthur

Attachment

LLogan M

I don't think its trivial to share memory between them

insert_batch_size is not really needed in this case, only embed_batch_size

LLogan M

You did

Plain Text

embed_model=HuggingFaceEmbedding(model_name="GritLM/GritLM-7B",cache_folder='/home/username/model_cache',device="cuda", embed_batch_size=1)
Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(documents, show_progress=True)

Right?

AArthur

Here's my code:

Plain Text

embed_model=HuggingFaceEmbedding(model_name="GritLM/GritLM-7B",cache_folder='/home/u/model_cache',device="cuda",embed_batch_size=1)
Settings.llm = llm
Settings.transformations = [SentenceSplitter(paragraph_separator = '\n\n')]
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents, show_progress=True,insert_batch_size=10)

LLogan M

Ah the other thing that uses memory will be the length

LLogan M

embed_model=HuggingFaceEmbedding(model_name="GritLM/GritLM-7B",cache_folder='/home/u/model_cache',device="cuda",embed_batch_size=1, max_length=512)

LLogan M

you might need something like that 🤔

AArthur

roger

AArthur

I assume the max_length is something like max_token?

LLogan M

yea exactly -- basically the huggingface tokenizer will truncate inputs to a token length that matches max_length

AArthur

Thank you!

AArthur

I'll try to run this model using your parameter set!

AArthur

It works!👍

LLogan M

nice!

Add a reply

Find answers from the community

Hello everyone,