I myself use
OpenAILike
instead of
HuggingFaceLLM
, so I actually don't know how most of the parameters you gave there would affect the quality.
For example, I don't know whether
HuggingFaceLLM
would take care of the prompt wrapping for you or not, so there is a chance that your
query_wrapper_prompt
is doing more detrimental work than it is helpful.
Since you are running models locally anyway, would you like to try using
OpenAiLike
and use
LM Studio
to serve the LLM?
Here's how:
from llama_index.llms import ChatMessage, OpenAILike
llm = OpenAILike(
api_base="http://localhost:1234/v1",
timeout=600, # secs
api_key="loremIpsum",
is_chat_model=True,
context_window=32768,
)
chat_history = [
ChatMessage(role="system", content="You are a bartender."),
ChatMessage(role="user", content="What do I enjoy drinking?"),
]
output = llm.chat(chat_history)
print(output)
(copied from here:
https://lmy.medium.com/comparing-langchain-and-llamaindex-with-4-tasks-2970140edf33)