Is there a method built in to evaluate a set of responses against a test set? I've seen the evaluation pipeline, but that seems to help tell you what source produced the response and self-test if the response is good (or hallucinated). I have a set of FAQs with the expected result and I want to compare the response to the expected response.