Hey Everyone, I have to run

At a glance

The community member is trying to run HuggingFaceEmbedding with multiple GPU support, but encountered an exception related to the to attribute not being available in the HuggingFaceEmbedding object. The community members suggest that since HuggingFaceEmbedding is a wrapper around a model, the parallelization should be done with the underlying model. They recommend loading the model using AutoModel and passing it to HuggingFaceEmbedding. However, this approach also ran into issues with the model_name parameter being required.

The community members further discuss the challenges of using torch.nn.parallel.DataParallel with HuggingFaceEmbedding, as the inputs need to be on the same device. They suggest using the text-embedding-interface library from Hugging Face as a more efficient alternative for multi-GPU support.

There is no explicitly marked answer in the comments, and the community members do not seem to have a definitive solution for using multiple GPUs with HuggingFaceEmbe

Useful resources

BBennison

Hey Everyone, I have to run HuggingFaceEmbedding with multiple GPU support. for this I tried the following code

Plain Text

from injector import inject, singleton
from llama_index import MockEmbedding
from llama_index.embeddings.base import BaseEmbedding

from private_gpt.paths import models_cache_path
from private_gpt.settings.settings import settings

from torch.nn.parallel import DataParallel
from torch.nn.parallel import DistributedDataParallel

@singleton
class EmbeddingComponent:
    embedding_model: BaseEmbedding

    @inject
    def __init__(self) -> None:
        match settings.llm.mode:
            case "local":
                from llama_index.embeddings import HuggingFaceEmbedding

                embedding_model = HuggingFaceEmbedding(
                    model_name=settings.local.embedding_hf_model_name,
                    cache_folder=str(models_cache_path),
                    embed_batch_size = 20,
                )
                self.embedding_model = DataParallel(embedding_model)
            case "sagemaker":

                from private_gpt.components.embedding.custom.sagemaker import (
                    SagemakerEmbedding,
                )

                self.embedding_model = SagemakerEmbedding(
                    endpoint_name=settings.sagemaker.embedding_endpoint_name,
                )

I got the exception when run this

Plain Text

  File "/home/bennison/Documents/yavar/poc/privateGPT/private_gpt/components/embedding/embedding_component.py", line 25, in __init__
    self.embedding_model = DataParallel(embedding_model)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bennison/.cache/pypoetry/virtualenvs/private-gpt-_Dc3_tu1-py3.11/lib/python3.11/site-packages/torch/nn/parallel/data_parallel.py", line 148, in __init__
    self.module.to(self.src_device_obj)
    ^^^^^^^^^^^^^^
AttributeError: 'HuggingFaceEmbedding' object has no attribute 'to'
make: *** [Makefile:36: run] Error 1

18 comments

LLogan M

uhhhh I don't think this will work. Since HuggingFaceEmbedding is not a pytorch model, it's a wrapper around a model

I think you'd have to do the parallelization with the underlying model. I thiiiiink you could wrap the model from huggingface in this?

So load the model with AutoModel and put it in this class you've created

Then you can pass the model in directly like HuggingFaceEmbedding(model=model) ?

BBennison

I tried as you said, here is the refactored code

Plain Text

                model = AutoModel.from_pretrained( # BAAI/bge-small-en
                    settings.local.embedding_hf_model_name, cache_dir=models_cache_path
                )
                self.embedding_model = HuggingFaceEmbedding(
                    model=model,
                )

After updating the code I got the following exception

Plain Text

  File "/home/bennison/.cache/pypoetry/virtualenvs/private-gpt-_Dc3_tu1-py3.11/lib/python3.11/site-packages/llama_index/embeddings/huggingface.py", line 98, in __init__
    super().__init__(
  File "/home/bennison/.cache/pypoetry/virtualenvs/private-gpt-_Dc3_tu1-py3.11/lib/python3.11/site-packages/pydantic/v1/main.py", line 341, in __init__
    raise validation_error
pydantic.v1.error_wrappers.ValidationError: 1 validation error for HuggingFaceEmbedding
model_name
  none is not an allowed value (type=type_error.none.not_allowed)
make: *** [Makefile:36: run] Error 1

I got the exception in model_name, When I update the code like the above is the model name required?

LLogan M

ah yea, try also passing in the model_name -- it will still use the model you pass in, the model_name is just there for tracking/observability

BBennison

Plain Text

                model = AutoModel.from_pretrained( # BAAI/bge-small-en
                    settings.local.embedding_hf_model_name, cache_dir=models_cache_path
                )
                self.embedding_model = HuggingFaceEmbedding(
                    model=model,
                )

Here in the above code I did not configure anything about cuda, Will it use all available (multiple) GPU automatically? or should I do any config for this (multiple GPU utalization)?

LLogan M

Hmmm yea something tricky is the inputs. With multiple GPUs, your inputs need to be on the same device

I know you can specify a specific device in the constructor

Plain Text

self.embedding_model = HuggingFaceEmbedding(
    model=model,
    device="cuda:0"
)

But not sure that entirely solves the issue 🤔

BBennison

Hey man, can you explain more elabratly I could not understand what you are saying, If I use device with the parameter cuda:0. will it use all available GPU

LLogan M

So you have multiple GPUs

You might have a GPU on cuda:0 and another GPU on cuda:1

So you need to ensure that your inputs are also moved to the same device

Tbh I don't actually think this will work though.... too complicated.

I would try using something like text-embedding-interface for proper multi-gpu support

https://github.com/huggingface/text-embeddings-inference
https://docs.llamaindex.ai/en/stable/examples/embeddings/text_embedding_inference.html

BBennison

I used the device paramter also, still it uses the single GPU not all.
Is there any other way to do this?

LLogan M

I did link an alternative using TEI from huggingface

Using these raw torch.nn.parallel stuff, I am less knowleagble about. It not using any GPU is likely related to this not specifying any GPU?

Plain Text

case "local":
  from llama_index.embeddings import HuggingFaceEmbedding
  
  embedding_model = HuggingFaceEmbedding(
      model_name=settings.local.embedding_hf_model_name,
      cache_folder=str(models_cache_path),
      embed_batch_size = 20,
  )
  self.embedding_model = DataParallel(embedding_model)

LLogan M

Other than that, 🤷‍♂️ out of ideas

BBennison

I don't have any idea about the embedding with multiGPU can you have any other idea?

LLogan M

Use this instead, is my only other suggestion. Will be much more effecient too

https://github.com/huggingface/text-embeddings-inference
https://docs.llamaindex.ai/en/stable/examples/embeddings/text_embedding_inference.html

BBennison

Here also, I don't know how to configure multiGPU, Can you help me with this?

LLogan M

I'm pretty sure the docker command they give will use all gpus

LLogan M

docker run --gpus all ...