eventually i got it run but for 1, i have this warning: .conda/lib/python3.12/site-packages/llama_cpp/llama.py:1138: RuntimeWarning: Detected duplicate leading "<s>" in prompt, this will likely reduce response quality, consider removing it... warnings.warn(
and for 2, it was killing my cpu, it gone up to 99% of usage, i rarely had any such issues with ollama, even with much larger llm, which is very strange, i thought llama-index is much more optimized. but i am very closely following what ever code was given in the tutorial, so i am not sure what i am missing here.
my idea is to have much more granular control over my hardware, such that i can customize gpu layers and do benchmarks or other applications with python.