so for things like faithfulness and relevancy, the scores are generated by some outside LLM right? There's no formula that's used to generate that value?
Thanks! I saw that cosine similarity can be switched for things like dot product and Euclidean distance, but is there any way to use other metrics like Bleu or Rouge?
also interesting, is there a reason why you chose those metrics (did were they good enough for measuring semantic similarity), or was it just a speed/ease of implementation situation?
In most cases people don't have ground truth to compare to, so it was a lower priority. Also, imo, they are a tad less helpful? maybe a hot take hahaha
Theres so many ways to write a response. A rouge score of 30 isn't really that informative, even if its what academia has clung to the past few years