Find answers from the community

Updated 9 months ago

Can someone please explain the

Can someone please explain the evaluation code logic?
๊ถŒ
W
19 comments
Plain Text
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)

from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

import numpy as np

evaluator_c = CorrectnessEvaluator(llm=eval_llm)
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=eval_llm)
evaluator_f = FaithfulnessEvaluator(llm=eval_llm)

pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm)


max_samples = 5

eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]

base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)

query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)

pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

sponse_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)

eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Ensemble Retriever", "Base Retriever"],
    ["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
I don't really understand. Please explain in detail what kind of logic it follows with comments.
But why go through gpt while generating the response? I'm curious about these logics. Please explain them in a little detail.
Sorry but please explain the logic to me in some more detail ๐Ÿฅน ๐Ÿฅน
The entire code or specific part you interested in?
When these codes are executed, communication with open AI occurs. May I know why this is happening?
Could you explain in more detail how this evaluation logic is used?

And is this a retriever evaluation comparison?

Plain Text
base_pred_responses = get_responses(
     eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
     eval_qs[:max_samples], query_engine, show_progress=True
)
eval_results = await batch_runner. evaluate_responses(
     queries=eval_qs[:max_samples],
     responses=pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
     queries=eval_qs[:max_samples],
     responses=base_pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)
Please explain in more detail as if you were explaining it to a stupid non-major.
I am a non-major, but I am very interested in this field. I would appreciate your help
Yeah dont worry. I'll try my best to explain.

So evaluators basically helps us evaluate right! ( in our case the evaluators are here to evaluate whether the response that we recieved at the end of index.query_engine().query() is actually what we wanted or not. )

Now there are two types of evaluators
  • One which performs semantic similarity between the response and source nodes ( Nodes which llm used to generate the response ) or maybe your provided set of answers.
https://docs.llamaindex.ai/en/stable/examples/evaluation/semantic_similarity_eval/#embedding-similarity-evaluator

  • Second one will be basically you provide the llm, your generated response and source nodes or maybe dataset and ask llm to compare whether the response that has been generated is correct or not.
https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval/



For the OpenAI communication that you mentioned, there are few evaluators like faithfullness which uses the second point that I mentioned to evaluate thus interaction with openAI happens. Check the code here: https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval/

This will give more clarity

The above code that you have mentioned is combining multiple evaluators and providing results: https://docs.llamaindex.ai/en/stable/examples/evaluation/batch_eval/
I hope this makes it clear, if not feel free to ask query
@WhiteFang_Jr I understand it to some extent, but it seems a little difficult for me to understand well.

Sorry, but could you please make a comment in the code I provided above? With flow and stuff like that


Plain Text
base_pred_responses = get_responses(
     eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
     eval_qs[:max_samples], query_engine, show_progress=True
)
eval_results = await batch_runner. evaluate_responses(
     queries=eval_qs[:max_samples],
     responses=pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
     queries=eval_qs[:max_samples],
     responses=base_pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)
Plain Text
# This is basically collecting responses based on the set of the question being passed 
# from `eval_qs[:max_samples]`
base_pred_responses = get_responses(
     eval_qs[:max_samples], base_query_engine, show_progress=True
)


pred_responses = get_responses(
     eval_qs[:max_samples], query_engine, show_progress=True
)

# these are two ways to get evaluation results. one normal and second with async
eval_results = await batch_runner. evaluate_responses(
     queries=eval_qs[:max_samples],
     responses=pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
     queries=eval_qs[:max_samples],
     responses=base_pred_responses[:max_samples],
     reference=ref_response_strs[:max_samples],
)


Also didnt find the above code in documentation, if you want to follow the latest doc for batch eval: https://docs.llamaindex.ai/en/stable/examples/evaluation/batch_eval/

follow this
I would suggest that you check the latest batch eval code, that is more easy to get idea
about its functioning
So the code I wrote is response evaluation? Or something? Retriever evaluation?
And this is a response comparison. What flow is the response comparison being used in? If you are comparing response, why are you evaluating base retrievers and ensemble retrievers? I don't really understand
Add a reply
Sign up and join the conversation on Discord