The community members discuss the generation of scores for faithfulness and relevancy. One community member confirms that these scores are generated by an outside language model (LLM), rather than a formula. Another community member asks if other metrics like BLEU or ROUGE can be used, and is told that while they are open to adding them, those metrics require ground truth data which is often not available. The discussion also covers the limitations of ROUGE scores and the choice of metrics used, with the community members noting that in many cases, people don't have ground truth data, so those metrics were not a high priority. The community members are open to accepting contributions that add new metrics to the system.
so for things like faithfulness and relevancy, the scores are generated by some outside LLM right? There's no formula that's used to generate that value?
Thanks! I saw that cosine similarity can be switched for things like dot product and Euclidean distance, but is there any way to use other metrics like Bleu or Rouge?
also interesting, is there a reason why you chose those metrics (did were they good enough for measuring semantic similarity), or was it just a speed/ease of implementation situation?
In most cases people don't have ground truth to compare to, so it was a lower priority. Also, imo, they are a tad less helpful? maybe a hot take hahaha
Theres so many ways to write a response. A rouge score of 30 isn't really that informative, even if its what academia has clung to the past few years