Hello, I am trying to figure out if it's

jjhthompson12

Hello, I am trying to figure out if it's possible to run the embeddings model on my GPU rather than the CPU. I have this simple script where VectorStoreIndex.from_documents(documents) is taking a long time to finish while maxing out my CPU.

Plain Text

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, set_global_service_context
 from llama_index.llms import OpenAILike 

llm = OpenAILike(max_tokens=3900)  

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en-v1.5", chunk_size=256, num_output=256) 
set_global_service_context(service_context)

documents = SimpleDirectoryReader('data2').load_data()

index = VectorStoreIndex.from_documents(documents) index.storage_context.persist(persist_dir="./vector-storage-esic2")

It seems like one of the following is true:

I've not configured something properly (in Llama-Index?) which would push the embeddings to the GPU
This is just how Llama-Index works and can only use the CPU for embeddings

any wisdom is greatly appreciated!

8 comments

LLogan M

It should use GPU automatically if you have CUDA installed 🤔

But otherwise, you can try specifying the embeddings with the class rather than the string

Plain Text

from llama_index.embeddings import HuggingFaceEmbedding

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cuda")
service_context = ServiceContext.from_defaults(embed_model=embed_model, ...)

jjhthompson12

interesting, thanks for the reply!

I do have CUDA 11.8 installed and have been successfully using it to run llama.cpp in the "OpenAI-compatible web server" mode (outside of Llama-Index). so im not sure why Llama-Indix is not picking it up automatically.

Let me give your suggestion a shot!

LLogan M

llama.cpp does not use CUDA the same way that huggingface does 🤔

As a quick test if it's still not working, you can try running this

Plain Text

import torch
print(torch.cuda.is_available())

jjhthompson12

ooo, I get False. and this is in the same venv that runs the llama-cpp-python server, though I understand your point that HF uses CUDA differently

LLogan M

Ah theres the issue then 👀 pytorch has a pretty good install command generator. Probably I would uninstall torch and reinstall it with the command this generates
https://pytorch.org/get-started/locally/

jjhthompson12

sick, I will give that a try and report back!

jjhthompson12

@Logan M amazing, that worked! I now get ...cuda.is_available() --> True and can see that

my original script is attaching a process to the GPU
the CPU no longer jumps to 100% during the embeddings process
the VectorStoreIndex is created much more quickly than before by probably an order of magnitude

Thanks again for the help!

LLogan M

Nice! Glad it worked!

Add a reply

Find answers from the community

Hello, I am trying to figure out if it's