Looks like inference api doesn't actually use those function hooks
If you have a conversational model though, it should automatically handle formatting
Hm, it seems to me that any slight change in the prompt changes the response drastically. My issue now is that sometimes the model would return the metadata with the response (file name, context), other queries and their answers, or even the question would be in the answer itself. It is very weird and only happens with Llama3/Mistral via HuggingFaceInferenceAPI in some queries meanwhile GPT 4 via Azure OpenAI is fine
Also, small changes in the prompt can make the chatbot go from a correct answer to a wrong answer with the same question
This is the current prompt:
Always answer the query {query_str} using the provided context information {context_str}, and not prior knowledge.
If you do not know the answer, you should say so.
Some rules you must follow:
1. Do not include context information or meta data.
2. Keep your answers concise but comprehensive.
3. Only provide the answer to the query asked and do not provide additional information.
Answer:
GPT4 is leagues ahead of llama3 π
So that difference would make sense
I don't think llama3/mistral is conversational by default, so it might not be auto-formatting
Actually I lied, looking at the code, you might be able to provide messages_to_prompt, assuming you have a function to transform the messages to the appropriate format (llama3's format is pretty complex)
I would also set is_chat_model=True
so that this chat() function is used consistently
Well yeah but I just want it to freaking stick to prompt and not give out context information or even file info. like it sometimes tells you the reference file size lol
im actually using a query engine and not a chat engine. While I am indeed building a chat bot, it seems that the query engine performs better.
Im using the Llama3-8B-instruct, is that the case for that too?
what about the query function?
.chat() is used for query engines as well. Just depends on is_chat_model=True/False
Otherwise you can format the prompt template in the actual format expected for llama3
imo the inference API is so confusing to use. I would just use ollama π
But thats just me
i initially used ollama but it takes a long time to run without a GPU. Say I use Ollama, do I need to format the template or not?
Interesting, I think your documentation can def be improved and I would love to contribute once I finish this project
Wait just to verify, .query() and .chat() are practically a reference to the same function with the difference of is_chat_model. I am asking this because when i use my query engine with .chat, I get an error
ollama handles all the foramtting for you. But yea, without any kind of GPU, it will be a tad slow
sorry, this is the llm, not the query engine. Youd still do query_engine.query()
, and under the hood, it will use llm.chat()
or llm.complete()
depending on is_chat_model
oh alright makes sense, i will try passing is_chat_model and i will see the behavior, thank you very much.
btw, the prompt im using is indeed showing effect compared to default prompt, so this makes me suspect that the inference api does also do formatting