Try updating your llama-cpp-python installation
which version should i use?
Whatever is the latest/newest
It was added a few weeks ago
are you in a notebook? try restarting the kernel
File ~/miniconda3/lib/python3.11/site-packages/llama_cpp/llama.py:2111, in Llama.n_ctx(self)
2109 def n_ctx(self) -> int:
2110 """Return the context window size."""
-> 2111 return self._ctx.n_ctx()
File ~/miniconda3/lib/python3.11/site-packages/llama_cpp/llama.py:428, in _LlamaContext.n_ctx(self)
427 def n_ctx(self) -> int:
--> 428 assert self.ctx is not None
429 return llama_cpp.llama_n_ctx(self.ctx)
AssertionError:
so not the same error π
not sure how n_ctx
is none, it's definitely defaulting to a value
pip show llama-cpp-python
what do you see?
Name: llama_cpp_python
Version: 0.2.18
Summary: Python bindings for the llama.cpp library
Home-page:
Author:
Author-email: Andrei Betlen <abetlen@gmail.com>
License: MIT
Location: /Users/ilanpinto/miniconda3/lib/python3.11/site-packages
Requires: diskcache, numpy, typing-extensions
Required-by:
did a kernal restart in between
@Logan M i am out for 1H but feel free to replay back
i am stuck with issue π¦
now i am getting this
gguf_init_from_file: invalid magic characters tjgg.
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
error loading model: llama_model_loader: failed to load model from /Users/ilanpinto/Library/Caches/llama_index/models/llama-2-13b-chat.ggmlv3.q4_0.bin
llama_load_model_from_file: failed to load model
maybe delete the cache folder
notice that the downloaded model is ggml, but we want a gguf model actually
this /Users/ilanpinto/Library/Caches/llama_index/models one?
what the diffrent between gguf and ggml?
Ggml was an old file type they stopped supporting
After 0.1.79 only gguf works
my colleuge Erik Jacobs says hi π
seems to work thank you!!!
got the same err π¦
File ~/miniconda3/lib/python3.11/site-packages/llama_cpp/llama.py:2111, in Llama.n_ctx(self)
2109 def n_ctx(self) -> int:
2110 """Return the context window size."""
-> 2111 return self._ctx.n_ctx()
File ~/miniconda3/lib/python3.11/site-packages/llama_cpp/llama.py:428, in _LlamaContext.n_ctx(self)
427 def n_ctx(self) -> int:
--> 428 assert self.ctx is not None
429 return llama_cpp.llama_n_ctx(self.ctx)
AssertionError:
I have no idea anymore tbh, I would have to spin up llama-cpp and debug, but I don't really have time at the moment π
I suggest getting it working without llama-index (loading and creating the model with llama-cpp directly), and then we can figure it out the differences from there
could that be related to jupyter?
I mean... maybe? Maybe try running in a .py
script and see if that makes a difference π
checking , in the meamwhile
another question
does the below version comaptible ?
llama-index==0.9.0
llama_cpp_python==0.2.18
ok just to santiy check, running this myself now. Downloading the model file currently
in a fresh env I installed llama-index and llama-cpp-python, and this code ran fine (abbreivated for reading)
>>> from llama_index.llms.utils import resolve_llm
>>> llm = resolve_llm("local")
Downloading url https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf to path /tmp/llama_index/models/llama-2-13b-chat.Q4_0.gguf
total size (MB): 7365.83
7025it [03:46, 31.02it/s]
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /tmp/llama_index/models/llama-2-13b-chat.Q4_0.gguf (version GGUF V2)
...
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
Then
>>> llm.complete("Hello!")
llama_print_timings: load time = 9481.09 ms
llama_print_timings: sample time = 23.21 ms / 78 runs ( 0.30 ms per token, 3361.05 tokens per second)
llama_print_timings: prompt eval time = 9480.97 ms / 67 tokens ( 141.51 ms per token, 7.07 tokens per second)
llama_print_timings: eval time = 29942.15 ms / 77 runs ( 388.86 ms per token, 2.57 tokens per second)
llama_print_timings: total time = 39609.32 ms
CompletionResponse(text=" Hello! I'm here to assist you with any questions or tasks you may have. Please feel free to ask me anything, and I will do my best to provide a helpful and accurate response. I am programmed to be respectful, honest, and to follow all given instructions. Please go ahead and ask your question or provide the task you would like me to complete.", additional_kwargs={}, raw={'id': 'cmpl-90b687a9-13f8-4428-ac0f-2437c7cd173b', 'object': 'text_completion', 'created': 1700170879, 'model': '/tmp/llama_index/models/llama-2-13b-chat.Q4_0.gguf', 'choices': [{'text': " Hello! I'm here to assist you with any questions or tasks you may have. Please feel free to ask me anything, and I will do my best to provide a helpful and accurate response. I am programmed to be respectful, honest, and to follow all given instructions. Please go ahead and ask your question or provide the task you would like me to complete.", 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 67, 'completion_tokens': 77, 'total_tokens': 144}}, delta=None)
Maybe start over with a fresh venv π
now i have other errors but i think i can handle it
thank you very much!!! @Logan M
Awesome, sounds good! (Well, not good, but better!)