Find answers from the community

Updated 3 months ago

When I run the following code, I am

When I run the following code, I am getting OutOfMemory error. I can run Ollama without issue in the terminal, but this script is causing OOM ... what do I need to change here? Can someone explain if I'm doing the ServiceContext bit correctly, because I'm not really sure I understand what it's supposed to be doing and was honestly just doing some copypasta there:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding from llama_index.core import Settings, ServiceContext from llama_index.llms.ollama import Ollama llm = Ollama(model="mistral", request_timeout=90.0) Settings.llm=llm service_context = ServiceContext.from_defaults(llm=llm, embed_model="local") Settings.embed_model = HuggingFaceEmbedding( model_name="sentence-transformers/all-MiniLM-L6-v2" ) resp = llm.complete("Suppose that you could prove from first principles that no group of odd order could compute the majority function. Why would this be a major result?") index = VectorStoreIndex.from_documents(docs, show_progress=True, service_context=service_context)

Why is this causing CUDA to be out of memory, when this model runs fine at terminal?
L
οΏ½
2 comments
Try reducing the batch size for embeddings

HuggingFaceEmbeddings(..., embed_batch_size=1)

Also, you can remove the service context, it is loading an embedding model twice if you keep it
thanks so much! that seems to be working πŸ™‚
Add a reply
Sign up and join the conversation on Discord