Find answers from the community

Updated 2 weeks ago

Inference Time for Llamaindex/vdr-2b-multi-v1 on Mac M2

Hey! It is normal that when running locally the new llamaindex/vdr-2b-multi-v1 in a mac m2, the inference takes 3 minutes for 10 pngs?

I am missing something in the config?

This is my code
Plain Text
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  # "mps" for mac, "cuda" for nvidia GPUs
    trust_remote_code=True,
    cache_folder="cache",
)
L
B
8 comments
3 minutes is kind of crazy, but its also a pretty big model (2.2B)
why not set mps?
wait, lets try that
should I delete the cache folder when changing any parameter?
it just changes how the model is loaded
mps throws:

Plain Text
MPS backend out of memory (MPS allocated: 18.12 GB, other allocations: 928.00 KB, max allowed: 18.13 GB). Tried to allocate 25.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).


Do you know any way of pass a custom parameter to this interface? API reference does not says much

In the web I found that passing PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.7 could work
Solved with PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 in .zshrc
Add a reply
Sign up and join the conversation on Discord