Hi, I'm following the example of the

At a glance

The community member is following the example of the llamacpp in the documentation but is getting an error when trying to use a Huggingface model on an Intel CPU. They have tried different versions of the llamacpp Python package and different model formats (ggmlv3 and gguf), and have found that the gguf format works better for their setup. They are also interested in using the Ollama library to run the model on a less powerful laptop with an Intel CPU, and have found a way to do this. Additionally, they are exploring the use of embeddings from one model (lbge-base-en) with the Llama2 7B model for generation, and have found that this approach works well. The community member is also interested in understanding the prompt engineering behind the llamaindex library and is using tools like Traceloop to investigate this.

Useful resources

ddavidp

Hi, I'm following the example of the llamacpp in the documentation but I get an error when trying to use a Huggingfacemodel. I'm running on intel CPU

https://gpt-index.readthedocs.io/en/v0.9.2/examples/llm/llama_2_llama_cpp.html

model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers":0},<------------I put this to 0 as I don't have GPU
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

20 comments

WWhiteFang_Jr

Not sure If I'm right but I have seen this pattern 😅

AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

And maybe Logan had mentioned that your model is going out of memory while loading.

MAYBE!

ddavidp

strange. I'm running on CPU with 64GB or RAM. I've seen a message related to the same ouput error message that suggested tot use a gguf instead of a ggmlv3...

I'll try with this model:
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q2_K.gguf"

WWhiteFang_Jr

Yeah could be. That's why i added MAYBE 😅

WWhiteFang_Jr

Found one thread:https://discord.com/channels/1059199217496772688/1174774402919960642/1174782379781529720

ddavidp

thanks for the indication 🙂
it's working with the gguf format

Attachment

ddavidp

inference time is decent given that I'm using a CPU 1165g7 intel

WWhiteFang_Jr

yeah it's actually good 👍

ddavidp

do you know if it's possible to install llamacpp in another laptop with M1 Pro CPU and make the call from the less powerful laptop with Intel CPU?
I could do something similar with Ollama but in the section of Ollama I don't see that llamaindex has anything similar. With ollama one uses the requests paackage like this:

 url = "http://192.168.1.xyz:11435/api/generate"
    data = {
        "model": 'llama2-uncensored',
        "prompt": prompt,
        "stream": False
    }

    result = requests.post(url, json=data)

    json_data = json.loads(result.text)

WWhiteFang_Jr

It's there, When you are defining llm via Ollama, Pass the base_url in the Ollama as well.

https://github.com/run-llama/llama_index/blob/43bb20c4ffeed1644e13700a564d6cbcfb726951/llama_index/llms/ollama.py#L32

ddavidp

thanks for the help @WhiteFang_Jr
for a 7B parameter model there was significative gain changing the CPU

ddavidp

Attachment

WWhiteFang_Jr

This is way better @davidp 🎉

ddavidp

I'm trying to make the whole RAG thing with local resources. For the embeddings, Ollama doesn't have that funcionality. So, I've used langchain to load a model from huggingface. It's working pretty well, but do you know if its a good idea to mix embeddings from a model X (lbge-base-en) and then use a model Y(llama2 7B) to make the generation of the final answer?

ddavidp

Attachment

WWhiteFang_Jr

Yes its fine. As embedding model works separately from the llm.

It converts all the docs into embedding and then retrieve the related chunks based on user query and then it is passed to the LLM.

Also, You can also use HF embedding from llamaindex also: https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface.html#huggingfaceembedding

ddavidp

thanks. It also works fine

Attachment

ddavidp

actually, I was wondering what is the final query passed to the ollama model from llamaindex at the generation stage. Theoretically, the retrieval gets some documents or chunks of documents but what is then told to ollama? something like "make me a summary of the following chunks:". Is it possible to see the prompt engineering in behind?
before these tests with llamaindex, I was using weaviate with gpt4all embeddings and for the generation stage I was doing the call to ollama with a prompt telling to make a summary of all the chunks retrieved.

WWhiteFang_Jr

I think this can help you in tracing the calls to LLM and more:https://docs.llamaindex.ai/en/stable/module_guides/observability/observability.html#arize-phoenix

ddavidp

Hi @WhiteFang_Jr , I decided to use the Traceloop to see the constructed prompt that is passed to the LLM based on the query and the retrieved documents. The Traceloop team had to update their library to support Ollama but it's now working.
My concern is if the constructed prompt can also be seen from the command line with some option for the query engine or some other class...

Attachment

ddavidp

at least the prompt shape can be obtained like this:

Attachment

Add a reply

Find answers from the community

Hi, I'm following the example of the