Can someone please explain the

At a glance

The post asks for an explanation of the evaluation code logic. The comments indicate that the code is evaluating a retrieval system, comparing the performance of a base retriever and an ensemble retriever. The evaluation involves using various evaluators (correctness, faithfulness, semantic similarity) to assess the responses generated by the retrievers. However, the community members express confusion about the specific flow of the evaluation process, how the retrievers are used to evaluate the pre-created dataset, and the overall purpose of the comparison. There is no explicitly marked answer, but the community members suggest looking at the source code for the evaluators to better understand the logic.

Useful resources

권권씨 😮💨

Can someone please explain the evaluation code logic?

권

13 comments

권권씨 😮💨

Plain Text

from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

import numpy as np

evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)

pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)


max_samples = 5

eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]

base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)

query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)

pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

sponse_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)

eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Ensemble Retriever", "Base Retriever"],
    ["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)

권권씨 😮💨

I don't really understand. Please explain in detail what kind of logic it follows with comments.

권권씨 😮💨

And this is a response comparison. What flow is the response comparison being used in? If you are comparing response, why are you evaluating base retrievers and ensemble retrievers? I don't really understand

LLogan M

I'm not really sure what you mean? This code runs each evaluator and then displays some response

correctness and faithfullness both evaluate the response

권권씨 😮💨

What kind of flow is the evaluation carried out?
I created an eval dataset using gpt4 and am curious about how this is used for evaluation.

The questions and answers have already been created with eval llm. What flow is used to compare them? Does the retriever generate and answer questions again? Or something??, I really don't understand, please explain

LLogan M

It takes the responses to generated questions and asks the LLM to evaluate them

I recommend just looking at the source code tbh
https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/faithfulness.py

https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/correctness.py

권권씨 😮💨

Yeah, I know that. But what kind of flow is used to request an evaluation from LLM? And what is the evaluation flow? There is a data set that has already been created. Through what flow do base retrievers and ensemble retrievers evaluate this?

권권씨 😮💨

omg, bro im understand Thanks , and sorry im korean, so my english so bed 😢

LLogan M

No worries, your english is great!

권권씨 😮💨

Oh, I have more questions. I understand how the evaluation proceeds. But how do retrievers evaluate already created data sets?

This is a comparison of responses. How do retrievers compare responses?

And why isn't this a retriever comparison?

권권씨 😮💨

What I'm curious about is this, I'm very curious about this.

LLogan M

I'm not sure I know what you mean 👀 The above code you posted is evaluating the entire RAG pipeline as an end to end system, retrieval is one step in that system

권권씨 😮💨

So what I'm curious about is how do retrievers evaluate this data set

Add a reply

Find answers from the community

Can someone please explain the