Find answers from the community

Updated 2 months ago

Inference Time for Llamaindex/vdr-2b-multi-v1 on Mac M2

At a glance

A community member is running the llamaindex/vdr-2b-multi-v1 model on a Mac M2 and is experiencing slow inference times of 3 minutes for 10 PNGs. They are unsure if this is normal and ask if they are missing something in the configuration. Other community members suggest trying to set the device to mps for the Mac, but this leads to an out of memory error. The solution is to set the PYTORCH_MPS_HIGH_WATERMARK_RATIO environment variable to 0.0 in the .zshrc file to disable the memory allocation limit.

Hey! It is normal that when running locally the new llamaindex/vdr-2b-multi-v1 in a mac m2, the inference takes 3 minutes for 10 pngs?

I am missing something in the config?

This is my code
Plain Text
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
    trust_remote_code=True,
    cache_folder="cache",
)
L
B
8 comments
3 minutes is kind of crazy, but its also a pretty big model (2.2B)
why not set mps?
should I delete the cache folder when changing any parameter?
it just changes how the model is loaded
mps throws:

Plain Text
MPS backend out of memory (MPS allocated: 18.12 GB, other allocations: 928.00 KB, max allowed: 18.13 GB). Tried to allocate 25.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).


Do you know any way of pass a custom parameter to this interface? API reference does not says much

In the web I found that passing PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 could work
Solved with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 in .zshrc
Add a reply
Sign up and join the conversation on Discord