Find answers from the community

Updated last year

Hi, I'm following the example of the

Hi, I'm following the example of the llamacpp in the documentation but I get an error when trying to use a Huggingfacemodel. I'm running on intel CPU

https://gpt-index.readthedocs.io/en/v0.9.2/examples/llm/llama_2_llama_cpp.html

model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"

llm = LlamaCPP( # You can pass in the URL to a GGML model to download it automatically model_url=model_url, # optionally, you can set the path to a pre-downloaded model instead of model_url model_path=None, temperature=0.1, max_new_tokens=256, # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room context_window=3900, # kwargs to pass to __call__() generate_kwargs={}, # kwargs to pass to __init__() # set to at least 1 to use GPU model_kwargs={"n_gpu_layers":0},<------------I put this to 0 as I don't have GPU # transform inputs into Llama2 format messages_to_prompt=messages_to_prompt, completion_to_prompt=completion_to_prompt, verbose=True, )

gguf_init_from_file: invalid magic characters tjgg.error loading model: llama_model_loader: failed to load model from /tmp/llama_index/models/llama-2-13b-chat.ggmlv3.q4_0.binllama_load_model_from_file: failed to load modelAVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

anybody know if I should change the version of the model or the llamacpp python package?
I've tried for instance with this version but it also didn't work:
!pip install llama-cpp-python==0.1.78
W
d
20 comments
Not sure If I'm right but I have seen this pattern πŸ˜…

AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

And maybe Logan had mentioned that your model is going out of memory while loading.

MAYBE!
strange. I'm running on CPU with 64GB or RAM. I've seen a message related to the same ouput error message that suggested tot use a gguf instead of a ggmlv3...

I'll try with this model:
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf"
Yeah could be. That's why i added MAYBE πŸ˜…
thanks for the indication πŸ™‚
it's working with the gguf format
Attachment
image.png
inference time is decent given that I'm using a CPU 1165g7 intel
yeah it's actually good πŸ‘
do you know if it's possible to install llamacpp in another laptop with M1 Pro CPU and make the call from the less powerful laptop with Intel CPU?
I could do something similar with Ollama but in the section of Ollama I don't see that llamaindex has anything similar. With ollama one uses the requests paackage like this:

url = "http://192.168.1.xyz:11435/api/generate" data = { "model": 'llama2-uncensored', "prompt": prompt, "stream": False } result = requests.post(url, json=data) json_data = json.loads(result.text)
thanks for the help @WhiteFang_Jr
for a 7B parameter model there was significative gain changing the CPU
Attachment
image.png
This is way better @davidp πŸŽ‰
I'm trying to make the whole RAG thing with local resources. For the embeddings, Ollama doesn't have that funcionality. So, I've used langchain to load a model from huggingface. It's working pretty well, but do you know if its a good idea to mix embeddings from a model X (lbge-base-en) and then use a model Y(llama2 7B) to make the generation of the final answer?
Attachment
image.png
Yes its fine. As embedding model works separately from the llm.

It converts all the docs into embedding and then retrieve the related chunks based on user query and then it is passed to the LLM.


Also, You can also use HF embedding from llamaindex also: https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html#huggingfaceembedding
thanks. It also works fine
Attachment
image.png
actually, I was wondering what is the final query passed to the ollama model from llamaindex at the generation stage. Theoretically, the retrieval gets some documents or chunks of documents but what is then told to ollama? something like "make me a summary of the following chunks:". Is it possible to see the prompt engineering in behind?
before these tests with llamaindex, I was using weaviate with gpt4all embeddings and for the generation stage I was doing the call to ollama with a prompt telling to make a summary of all the chunks retrieved.
Hi @WhiteFang_Jr , I decided to use the Traceloop to see the constructed prompt that is passed to the LLM based on the query and the retrieved documents. The Traceloop team had to update their library to support Ollama but it's now working.
My concern is if the constructed prompt can also be seen from the command line with some option for the query engine or some other class...
Attachment
Screenshot_2023-11-25_at_14.30.48.png
at least the prompt shape can be obtained like this:
Attachment
image.png
Add a reply
Sign up and join the conversation on Discord