When I run the following code, I am getting OutOfMemory error. I can run Ollama without issue in the terminal, but this script is causing OOM ... what do I need to change here? Can someone explain if I'm doing the ServiceContext bit correctly, because I'm not really sure I understand what it's supposed to be doing and was honestly just doing some copypasta there:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, ServiceContext
from llama_index.llms.ollama import Ollama
llm = Ollama(model="mistral", request_timeout=90.0)
Settings.llm=llm
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local")
Settings.embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
resp = llm.complete("Suppose that you could prove from first principles that no group of odd order could compute the majority function. Why would this be a major result?")
index = VectorStoreIndex.from_documents(docs, show_progress=True, service_context=service_context)
Why is this causing CUDA to be out of memory, when this model runs fine at terminal?