Find answers from the community

Updated 6 months ago

Can someone please explain the

At a glance

The post asks for an explanation of the evaluation code logic. The comments indicate that the code is evaluating a retrieval system, comparing the performance of a base retriever and an ensemble retriever. The evaluation involves using various evaluators (correctness, faithfulness, semantic similarity) to assess the responses generated by the retrievers. However, the community members express confusion about the specific flow of the evaluation process, how the retrievers are used to evaluate the pre-created dataset, and the overall purpose of the comparison. There is no explicitly marked answer, but the community members suggest looking at the source code for the evaluators to better understand the logic.

Useful resources
Can someone please explain the evaluation code logic?
๊ถŒ
L
13 comments
Plain Text
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

import numpy as np

evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)

pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)


max_samples = 5

eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]

base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)

query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)

pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

sponse_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)

eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Ensemble Retriever", "Base Retriever"],
    ["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
I don't really understand. Please explain in detail what kind of logic it follows with comments.
And this is a response comparison. What flow is the response comparison being used in? If you are comparing response, why are you evaluating base retrievers and ensemble retrievers? I don't really understand
I'm not really sure what you mean? This code runs each evaluator and then displays some response

correctness and faithfullness both evaluate the response
What kind of flow is the evaluation carried out?
I created an eval dataset using gpt4 and am curious about how this is used for evaluation.

The questions and answers have already been created with eval llm. What flow is used to compare them? Does the retriever generate and answer questions again? Or something??, I really don't understand, please explain
Yeah, I know that. But what kind of flow is used to request an evaluation from LLM? And what is the evaluation flow? There is a data set that has already been created. Through what flow do base retrievers and ensemble retrievers evaluate this?
omg, bro im understand Thanks , and sorry im korean, so my english so bed ๐Ÿ˜ข
No worries, your english is great!
Oh, I have more questions. I understand how the evaluation proceeds. But how do retrievers evaluate already created data sets?

This is a comparison of responses. How do retrievers compare responses?

And why isn't this a retriever comparison?
What I'm curious about is this, I'm very curious about this.
I'm not sure I know what you mean ๐Ÿ‘€ The above code you posted is evaluating the entire RAG pipeline as an end to end system, retrieval is one step in that system
So what I'm curious about is how do retrievers evaluate this data set
Add a reply
Sign up and join the conversation on Discord