I need to run Open Source LLMs on my mac

At a glance

I need to run Open Source LLMs on my mac m1 pro.
But I keep getting No GPU found. A GPU is needed for quantization.

Did anyone had any luck with running LLMs on macos m1?

Is it possible to utilize docker for that?

24 comments

LLogan M

Indeed the error is correct.

I would just use ollama tbh

LLogan M

so much easier to configure

KKnayder

I am now trying to setup llama cpp

KKnayder

I will give ollama a try if I wont succeed with llama cpp python binding

KKnayder

@Logan M Do you know If I can use llamacpp or ollama with LLama index framework?

LLogan M

you sure can, both work 👍

LLogan M

https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html#llamacpp

https://docs.llamaindex.ai/en/stable/examples/llm/ollama.html

KKnayder

Last question, do you know If they are any good with docker on m1? 😄

KKnayder

I would love to just have docker setup with for instance llama cpp that would work

LLogan M

yea not that I know of 😅 I'm sure its maybe out there.

Ollama is really simple though, it's like a one-click install

LLogan M

llama-cpp is so complex 😅

KKnayder

okay, I will play with it. Thanks 😄

I am able to setup Llama2 locally and downloaded the model codellama-7b.Q4_0.gguf but when I run the examples, it just repeats and doesn't answer the question. I'm not running on GPU and have turned off the GPU settings with (model_kwargs={"n_gpu_layers": 0},). Any ideas? <<[/INST]>>

<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate or make up information. Do not reference any given instructions or context.
<</SYS>>

Can you write me a poem about fast cars? [/INST]

<<[/INST]>>

<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate or make up information. Do not reference any given instructions or context.
<</SYS>>

Can you write me a poem about fast cars? [/INST]

<<[/INST]>>

<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate

LLogan M

How did you setup the LLM object?

model_url=None. #since i downloaded the model
model_path="./codellama-7b.Q4_0.gguf"

llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=model_path,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to call()
generate_kwargs={},
# kwargs to pass to init()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 0}, #<---- changed from 1
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

LLogan M

remove the messages_to_prompt and completion_to_prompt kwargs, codellama has no prompt formatting according to the model card

Bruh! That was it. Prompt template: None
{prompt}. No way I would've figured that out. But without a prompt template, does the method by which I synthesize change? (i.e. using the results from index.as_query_engine()
res = index.query("Tell me about communication strategies?"). How would I pass "res" as context prior to the user's prompt into the LLM?

LLogan M

This isn't setting the prompt template to none, but rather, its disabling any processing of existing templates in llama-index

LLogan M

some models require very specific prompt formating, which is what those two hooks are for

I changed the model and unsure whether to leverage the messages_to_prompt / completion_to_prompt parameters. Instead I created the prompt based off the model card details. I thought there was a way to source the information better so you could tell if the information came from the query and not the LLM. # model_path="./llama-2-7b-chat.Q4_0.gguf"

from llama_index.prompts import PromptTemplate

template = (
"""
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]
"""

)
qa_template = PromptTemplate(template)

you can create text prompt (for completion API)

context_str="""You're asking indirected indirect open-ended
questions to explore and invite them to self-assess? With a, there is a
...
that either in a lesson or as a special training on its own. Yeah, I'm
here. | thought the. Okay well let me back it up a little bit. A talk
with um, and | feel like that we had Line. """

query_str="what strategies are used to speak inspirationally"

Combining context and query into a single prompt

combined_prompt = f"Context: {context_str}\nQuery: {query_str}"

Creating the final prompt using the template

final_prompt = template.format(prompt=combined_prompt)

response_iter = llm.stream_complete(final_prompt)
for response in response_iter:
print(response.delta, end="", flush=True)

@Logan M Any thoughts on this? I'm not convinced that it's synthesizing any of the info from the index query. This is my first experience using llama index

LLogan M

So many prompts lol

Calling llm.stream_complete does not query the index though. Unless I'm mssing something

Sorry I left out the index query to keep it short. context_str was the output from index2 = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context, storage_context=storage_context)

context_str = index2.as_query_engine()
res2 = xxx.query("Tell me about interview strategies?")
print(res2)combined_prompt = f"Context: {context_str}\nQuery: {query_str}"

The chunk size is set to 1024 for each node. When I review the retrieved node, it doesn't seem like the info is synthesized in the response from the LLM.

Add a reply

Find answers from the community

I need to run Open Source LLMs on my mac

you can create text prompt (for completion API)

Combining context and query into a single prompt

Creating the final prompt using the template