Hi, I'm having an issue with "OutOfMemoryError: CUDA out of memory." error. I have rtx 3060 ti 8 GB gpu but still cant execute this model. I tried on kaggle with 16 GB gpu but still got the same error.
Here is my code for simple llm execute pipeline from LlamaIndex:
# Load the your data
from llama_index.core import SimpleDirectoryReader, SummaryIndex
documents = SimpleDirectoryReader("/kaggle/input/dataset").load_data()
index = SummaryIndex.from_documents(documents)
from llama_index.core import PromptTemplate
# Transform a string into input zephyr-specific input
def completion_to_prompt(completion):
...
# Transform a list of chat messages into zephyr-specific input
def messages_to_prompt(messages):
...
import torch
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
Settings.llm = HuggingFaceLLM(
model_name="StabilityAI/stablelm-tuned-alpha-3b",
tokenizer_name="StabilityAI/stablelm-tuned-alpha-3b",
context_window=3900,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "top_k": 50, "top_p": 0.95},
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
device_map="auto",
)
# define embed model
Settings.embed_model = "local:BAAI/bge-base-en-v1.5"
# Query and print response
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do after his time at Y Combinator?")
print(response)
Is there any solution so I can use this model?Changing lines in the code maybe? And i didnt get it. I used 16 gb gpu in kaggle but still cant executing.