Find answers from the community

Updated 4 months ago

hello!

At a glance
hello!
I've been trying to use the retriever evaluation but I can't seem to get it working.
No matter what I run, it always returns this:
Metrics: {'mrr': 0.0, 'hit_rate': 0.0}
J
L
a
22 comments
I have the following code:

print('loading index') index = load_index_from_storage(storage_context) print('loading dataset') qa_dataset = EmbeddingQAFinetuneDataset.from_json("./data/100qa_evaluation_dataset.json") retriever = index.as_retriever(similarity_top_k=5) llm = OpenAI(model="gpt-3.5-turbo") retriever_evaluator = RetrieverEvaluator.from_metric_names( ["mrr", "hit_rate"], retriever=retriever ) sample_id, sample_query = list(qa_dataset.queries.items())[0] sample_expected = qa_dataset.relevant_docs[sample_id] eval_result = retriever_evaluator.evaluate(sample_query, sample_expected) print(eval_result)
testing the individual retriever.retrieve(query) returns the top 5 relevant documents to the query
any idea @Logan M ?
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected) returns zero for both?
nodes = retriever.retrieve(sample_query) -- and this matches sample_expected for sure?
It's checking node_id, not the actual text contents
I would say the retriever works ok:

nodes = retriever.retrieve(sample_query)
>>> len(nodes)
5
sample_expected
sample_expected
['0e576fbe-0a8f-45b3-a030-15e252ad30c4']
and visually looking at the data, the query and the returned node_id (along with the node information) is correct
print([x.node_id for x in nodes])
Do any of those match sample_expected ?
turns out that no, none of them are matching the expected sample. i honestly dont understand πŸ€·β€β™‚οΈ
ive used a 4096 chunk size to create the QA pairs (larger chunk size to get more meaning into the questions)
ive used sentence window as index for the query engine
Cool! So at least the metrics are working as intended πŸ˜…

I thiiiink I know what's going on here?

Did you save the dataset to disk, but not the actual index used to create the dataset?

If you load the same data again and re-create the index (rather than loading the saved index), the node ids will be regenerated and not match the IDs in the dataset
ah, ok, i think i understand and could be very well what you just said (cant remember the exact steps i took). does that mean that the dataset i generated is basically useless now? at least for the retrieval evaluation
so what would be the best practices?

  1. generate index + store to disk
  2. generate dataset from the same index + save dataset to json
  3. perform eval based on dataset from step 2?
yea that's the recommended flow -- I realize this isn't exactly spelled out in the docs though, I've made a note to improve this
so does that mean that one dataset is good enough only for a specific index?

for example my goal is to evaluate multiple retrieval systems with / without reranking
if i generate my dataset based on 2048 chunk size, then the dataset will never work for sentence window retrievers, right? because if i have 1000 documents that with 2048 chunk size will become 2000 nodes, with sentence window will be 30k nodes
i mean i guess i can use the questions for the answer generation part, but the retrieval evaluation specifically must be done with individually generated datasets?
Hmm, let me see if I can get @nerdai to jump in on this -- he's much more of an expert on the eval code compared to me πŸ˜…
hey @JAX , just reading up on this thread now. Yea you're thinking is correct with the current implementation of EmbeddingQAFinetuneDataset . This dataset is strongly coupled to the index. So, if you're index changes due to whatever, the you'll need to create a new EmbeddingQAFinetuneDataset to properly evaluate a retriever. So, if you're trying to evaluate multiple retrieval systems, then you can use the same dataset in 2. so long as 1. doesn't change πŸ™‚

Alternative approaches would be to consider the raw text content of the nodes rather than their assigned id's at index creation time. If you consider the raw text, then it may not matter how you construct your index. You can for example jsut compute a text similarity between expected context text and the retrieved context text. Though, depending on your use case, this may or may not be suitable. Just throwing it out there as another option to potentially consider.
undersood @nerdai !
thank you, as always, @Logan M and @nerdai for the help
Add a reply
Sign up and join the conversation on Discord