GPU

TTheMightyBono

Hi , i m using a custom embedding , when updating the index is there a way to intiate using the gpu ?
cause ive a large data and updating the embedding in the documentstore using cpu takes really long :

Index the documents using the Llama index and the custom embedding

index = VectorStoreIndex.from_documents(documents,storage_context=storage_context,service_context=service_context)

10 comments

WWhiteFang_Jr

I think if you are using model via Hugginface then passing the kwargs should work.

Plain Text

from langchain.embeddings import HuggingFaceEmbeddings
from llama_index import ServiceContext, set_global_service_context

embed_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs={"device": "cuda"}
)


service_context = ServiceContext.from_defaults(embed_model=embed_model)

# optionally set a global service context
set_global_service_context(service_context)

TTheMightyBono

hi , thank u for answering @WhiteFang_Jr , the model i m using is being loaded localy , its nt in the hub ,
i defined it as custom embedder :
`
class CustomMPNetEmbeddings(BaseEmbedding):
_model = PrivateAttr()
_tokenizer = PrivateAttr()
_instruction: str = PrivateAttr()

def init(
self,
model_path: str,
instruction: str = "Represent a document for semantic search:",
kwargs: Any ) -> None: self._tokenizer = AutoTokenizer.from_pretrained(model_path) self._model = AutoModel.from_pretrained(model_path) self._instruction = instruction super().init(kwargs)

@classmethod
def class_name(cls) -> str:
return "mpnet_custom"

async def _aget_query_embedding(self, query: str) -> List[float]:
return self._get_query_embedding(query)

async def _aget_text_embedding(self, text: str) -> List[float]:
return self._get_text_embedding(text)

def _get_query_embedding(self, query: str) -> List[float]:
return self._get_embedding(query)

def _get_text_embedding(self, text: str) -> List[float]:
return self._get_embedding(text)
def _get_embedding(self, text: str) -> List[float]:
# Tokenize text
encoded_input = self._tokenizer(text, return_tensors='pt', padding=True, truncation=True)
with torch.no_grad():
model_output = self._model(**encoded_input)
embedding = self._mean_pooling(model_output, encoded_input['attention_mask'])
embedding = F.normalize(embedding, p=2, dim=1)
return embedding.squeeze().tolist()
def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
embeddings = []
for text in texts:
embeddings.append(self._get_embedding(text))
return embeddings

TTheMightyBono

then used the class this way :
------------------------------------
then passed it :

Create the service context

service_context = ServiceContext.from_defaults(
embed_model=CustomMPNetEmbeddings(model_path=model_path),
chunk_size=512 # or any chunk size you prefer
)

TTheMightyBono

can u still use the gpu ?

WWhiteFang_Jr

You'll have to check the BaseEmbedding class to verify if you can use GPU with local model. I think that will be a good place to start looking

TTheMightyBono

okay thanks , one last question please , @WhiteFang_Jr in the context of Single query decomposition following this tutorial :
https://github.com/run-llama/llama_index/blob/main/docs/examples/query_transformations/SimpleIndexDemo-multistep.ipynb
---------------
i want to use my embedding model embedding model for retreival and the LLM as node parserr , how can configure that ?
cause in the tutorial i think he uses both

TTheMightyBono

i checked and sadly it doesnt , i guess ive to upload the model to hf so i can use the gpu

LLogan M

you could have done self._model = AutoModel.from_pretrained(model_path).to('cuda')

LLogan M

But then you also have to move the model inputs to GPU

LLogan M

encoded_input = {key: val.to('cuda') for key, val in encoded_input}

Add a reply

Find answers from the community

GPU

Index the documents using the Llama index and the custom embedding

Create the service context