hello!

At a glance

hello!
I've been trying to use the retriever evaluation but I can't seem to get it working.
No matter what I run, it always returns this:
Metrics: {'mrr': 0.0, 'hit_rate': 0.0}

22 comments

JJAX

I have the following code:

print('loading index')
index = load_index_from_storage(storage_context)

print('loading dataset')

qa_dataset = EmbeddingQAFinetuneDataset.from_json("./data/100qa_evaluation_dataset.json")

retriever = index.as_retriever(similarity_top_k=5)

llm = OpenAI(model="gpt-3.5-turbo")

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]    
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)

print(eval_result)

JJAX

testing the individual retriever.retrieve(query) returns the top 5 relevant documents to the query

JJAX

any idea @Logan M ?

LLogan M

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected) returns zero for both?

LLogan M

nodes = retriever.retrieve(sample_query) -- and this matches sample_expected for sure?

LLogan M

It's checking node_id, not the actual text contents

JJAX

I would say the retriever works ok:

nodes = retriever.retrieve(sample_query)
>>> len(nodes)
5

JJAX

sample_expected

JJAX

sample_expected
['0e576fbe-0a8f-45b3-a030-15e252ad30c4']

JJAX

and visually looking at the data, the query and the returned node_id (along with the node information) is correct

LLogan M

print([x.node_id for x in nodes])

LLogan M

Do any of those match sample_expected ?

JJAX

turns out that no, none of them are matching the expected sample. i honestly dont understand 🤷‍♂️

JJAX

ive used a 4096 chunk size to create the QA pairs (larger chunk size to get more meaning into the questions)

JJAX

ive used sentence window as index for the query engine

LLogan M

Cool! So at least the metrics are working as intended 😅

I thiiiink I know what's going on here?

Did you save the dataset to disk, but not the actual index used to create the dataset?

If you load the same data again and re-create the index (rather than loading the saved index), the node ids will be regenerated and not match the IDs in the dataset

JJAX

ah, ok, i think i understand and could be very well what you just said (cant remember the exact steps i took). does that mean that the dataset i generated is basically useless now? at least for the retrieval evaluation
so what would be the best practices?

generate index + store to disk
generate dataset from the same index + save dataset to json
perform eval based on dataset from step 2?

LLogan M

yea that's the recommended flow -- I realize this isn't exactly spelled out in the docs though, I've made a note to improve this

JJAX

so does that mean that one dataset is good enough only for a specific index?

for example my goal is to evaluate multiple retrieval systems with / without reranking
if i generate my dataset based on 2048 chunk size, then the dataset will never work for sentence window retrievers, right? because if i have 1000 documents that with 2048 chunk size will become 2000 nodes, with sentence window will be 30k nodes
i mean i guess i can use the questions for the answer generation part, but the retrieval evaluation specifically must be done with individually generated datasets?

LLogan M

Hmm, let me see if I can get @nerdai to jump in on this -- he's much more of an expert on the eval code compared to me 😅

aandrei

hey @JAX , just reading up on this thread now. Yea you're thinking is correct with the current implementation of EmbeddingQAFinetuneDataset . This dataset is strongly coupled to the index. So, if you're index changes due to whatever, the you'll need to create a new EmbeddingQAFinetuneDataset to properly evaluate a retriever. So, if you're trying to evaluate multiple retrieval systems, then you can use the same dataset in 2. so long as 1. doesn't change 🙂

Alternative approaches would be to consider the raw text content of the nodes rather than their assigned id's at index creation time. If you consider the raw text, then it may not matter how you construct your index. You can for example jsut compute a text similarity between expected context text and the retrieved context text. Though, depending on your use case, this may or may not be suitable. Just throwing it out there as another option to potentially consider.

JJAX

undersood @nerdai !
thank you, as always, @Logan M and @nerdai for the help

Add a reply

Find answers from the community

hello!