The post asks if anyone has tried using Llama.cpp with LlamaIndex to access quantized models on a CPU-only setup, and whether using CustomLLM would work. The comments suggest that the LlamaIndex team plans to add native support for Llama.cpp, but in the meantime, a community member can instantiate the Llama.cpp class from LangChain and wrap it in the LangChainLLM wrapper provided by LlamaIndex. Another community member notes that Llama2 is picky about prompt formatting, and suggests using the LlamaIndex Prompt function to create custom system and prompt formats. The final comment recommends wrapping Llama.cpp with the CustomLLM class, which allows customizing both the completion and chat endpoints, and using utility functions provided by LlamaIndex.
Curious if anyone tried using Llama.cpp with LlamaIndex so they can easily access quantized models with CPU only setup. Will using CustomLLM do the trick?
Sorry for the late reply. I was able to load with LangChain, but Llama2 is very picky about prompt format. It needs things like [INST]. Do I just use the LlamaIndex Prompt function to create my own system and prompt formats? I know the HuggingFaceLLM functions have both as parameters