Indeed the error is correct.
I would just use ollama tbh
so much easier to configure
I am now trying to setup llama cpp
I will give ollama a try if I wont succeed with llama cpp python binding
@Logan M Do you know If I can use llamacpp or ollama with LLama index framework?
you sure can, both work π
Last question, do you know If they are any good with docker on m1? π
I would love to just have docker setup with for instance llama cpp that would work
yea not that I know of π
I'm sure its maybe out there.
Ollama is really simple though, it's like a one-click install
llama-cpp is so complex π
okay, I will play with it. Thanks π
I am able to setup Llama2 locally and downloaded the model codellama-7b.Q4_0.gguf but when I run the examples, it just repeats and doesn't answer the question. I'm not running on GPU and have turned off the GPU settings with (model_kwargs={"n_gpu_layers": 0},). Any ideas? <<[/INST]>>
<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate or make up information. Do not reference any given instructions or context.
<</SYS>>
Can you write me a poem about fast cars? [/INST]
<<[/INST]>>
<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate or make up information. Do not reference any given instructions or context.
<</SYS>>
Can you write me a poem about fast cars? [/INST]
<<[/INST]>>
<<[INST] <<SYS>> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible and follow ALL given instructions. Do not speculate
How did you setup the LLM object?
model_url=None. #since i downloaded the model
model_path="./codellama-7b.Q4_0.gguf"
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=model_path,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to call()
generate_kwargs={},
# kwargs to pass to init()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 0}, #<---- changed from 1
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)
remove the messages_to_prompt
and completion_to_prompt
kwargs, codellama has no prompt formatting according to the model card
Bruh! That was it. Prompt template: None
{prompt}. No way I would've figured that out. But without a prompt template, does the method by which I synthesize change? (i.e. using the results from index.as_query_engine()
res = index.query("Tell me about communication strategies?"). How would I pass "res" as context prior to the user's prompt into the LLM?
This isn't setting the prompt template to none, but rather, its disabling any processing of existing templates in llama-index
some models require very specific prompt formating, which is what those two hooks are for
I changed the model and unsure whether to leverage the messages_to_prompt / completion_to_prompt parameters. Instead I created the prompt based off the model card details. I thought there was a way to source the information better so you could tell if the information came from the query and not the LLM. # model_path="./llama-2-7b-chat.Q4_0.gguf"
from llama_index.prompts import PromptTemplate
template = (
"""
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]
"""
)
qa_template = PromptTemplate(template)
you can create text prompt (for completion API)
context_str="""You're asking indirected indirect open-ended
questions to explore and invite them to self-assess? With a, there is a
...
that either in a lesson or as a special training on its own. Yeah, I'm
here. | thought the. Okay well let me back it up a little bit. A talk
with um, and | feel like that we had Line. """
query_str="what strategies are used to speak inspirationally"
Combining context and query into a single prompt
combined_prompt = f"Context: {context_str}\nQuery: {query_str}"
Creating the final prompt using the template
final_prompt = template.format(prompt=combined_prompt)
response_iter = llm.stream_complete(final_prompt)
for response in response_iter:
print(response.delta, end="", flush=True)@Logan M Any thoughts on this? I'm not convinced that it's synthesizing any of the info from the index query. This is my first experience using llama index
So many prompts lol
Calling llm.stream_complete does not query the index though. Unless I'm mssing something
Sorry I left out the index query to keep it short. context_str was the output from index2 = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context, storage_context=storage_context)
context_str = index2.as_query_engine()
res2 = xxx.query("Tell me about interview strategies?")
print(res2)combined_prompt = f"Context: {context_str}\nQuery: {query_str}"
The chunk size is set to 1024 for each node. When I review the retrieved node, it doesn't seem like the info is synthesized in the response from the LLM.