Inference Time for Llamaindex/vdr-2b-multi-v1 on Mac M2

At a glance

A community member is running the llamaindex/vdr-2b-multi-v1 model on a Mac M2 and is experiencing slow inference times of 3 minutes for 10 PNGs. They are unsure if this is normal and ask if they are missing something in the configuration. Other community members suggest trying to set the device to mps for the Mac, but this leads to an out of memory error. The solution is to set the PYTORCH_MPS_HIGH_WATERMARK_RATIO environment variable to 0.0 in the .zshrc file to disable the memory allocation limit.

BBenjamin Bascary

Hey! It is normal that when running locally the new llamaindex/vdr-2b-multi-v1 in a mac m2, the inference takes 3 minutes for 10 pngs?

I am missing something in the config?

This is my code

Plain Text

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
    trust_remote_code=True,
    cache_folder="cache",
)

8 comments

LLogan M

3 minutes is kind of crazy, but its also a pretty big model (2.2B)

LLogan M

why not set mps?

BBenjamin Bascary

wait, lets try that

BBenjamin Bascary

should I delete the cache folder when changing any parameter?

LLogan M

nah

LLogan M

it just changes how the model is loaded

BBenjamin Bascary

mps throws:

Plain Text

MPS backend out of memory (MPS allocated: 18.12 GB, other allocations: 928.00 KB, max allowed: 18.13 GB). Tried to allocate 25.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

Do you know any way of pass a custom parameter to this interface? API reference does not says much

In the web I found that passing PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 could work

BBenjamin Bascary

Solved with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 in .zshrc

Add a reply

Find answers from the community

Inference Time for Llamaindex/vdr-2b-multi-v1 on Mac M2