Find answers from the community

Updated 3 months ago

Can someone please explain the

Can someone please explain the evaluation code logic?
๊ถŒ
L
13 comments
Plain Text
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

import numpy as np

evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)

pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)


max_samples = 5

eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]

base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)

query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)

pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

sponse_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)

eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Ensemble Retriever", "Base Retriever"],
    ["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
I don't really understand. Please explain in detail what kind of logic it follows with comments.
And this is a response comparison. What flow is the response comparison being used in? If you are comparing response, why are you evaluating base retrievers and ensemble retrievers? I don't really understand
I'm not really sure what you mean? This code runs each evaluator and then displays some response

correctness and faithfullness both evaluate the response
What kind of flow is the evaluation carried out?
I created an eval dataset using gpt4 and am curious about how this is used for evaluation.

The questions and answers have already been created with eval llm. What flow is used to compare them? Does the retriever generate and answer questions again? Or something??, I really don't understand, please explain
Yeah, I know that. But what kind of flow is used to request an evaluation from LLM? And what is the evaluation flow? There is a data set that has already been created. Through what flow do base retrievers and ensemble retrievers evaluate this?
omg, bro im understand Thanks , and sorry im korean, so my english so bed ๐Ÿ˜ข
No worries, your english is great!
Oh, I have more questions. I understand how the evaluation proceeds. But how do retrievers evaluate already created data sets?

This is a comparison of responses. How do retrievers compare responses?

And why isn't this a retriever comparison?
What I'm curious about is this, I'm very curious about this.
I'm not sure I know what you mean ๐Ÿ‘€ The above code you posted is evaluating the entire RAG pipeline as an end to end system, retrieval is one step in that system
So what I'm curious about is how do retrievers evaluate this data set
Add a reply
Sign up and join the conversation on Discord