I have just completed the fine tuning of 3.5turbo ("Fine-tuning to Memorize Knowledge"). Here are the results if anyone is thinking about fine tuning vs RAGing:
- first off, shout-out to the Llama-Index team and for putting in place this well thought out and very comprehensive guide and eval framework, and for helping me resolve some of the issues. There are some minor hiccups with the code but you can resolve those easily if they get flagged on your machine;
- I fine tuned on a legal textbook "Legal Research, Analysis & Writing". It is a very foundational treatise if you want to become a lawyer;
- ~1800 questions / answers, split into 70/30 for train / val;
- did two iterations. first , fine tuned using train / val datasets, and then fined tuned again the already fine tuned model by using the complete dataset as training only;
- you will see in the code that ground truth (gt) comes from gpt4, while the base is gpt3.5. while i am bit hesitant to compare apples to oranges, for the main question i was after (is RAG framework a solid approach), it worked;
- objective results:
'ft_rag_score': 0.775,
'ft_score': 0.725,
'rag_score': 0.825,
'base_score': 0.675
as you can see 'rag only' wins.
- the temperature was 0 for all models
- in terms of legal style, vocabulary and coherence, gpt3.5 turbo is already quite good on the 'how' part (can explain how to write a legal memo, how to research a case, etc), but still hallucinates a lot when it comes to 'what' part (what is case x about, when was statute y enacted, etc, );
- i was surprised to see that ft_rag is a little worse than rag only , but it is good to know that grounding models in existing knowledge works great π ;
- happy to help or answer any questions. thanks again LlamaIndex team for doing what you are doing.