Find answers from the community

Updated last year

Embedding benchmark

At a glance

The community members are discussing how to benchmark multiple embedding models on their data. They suggest checking the embedding model leaderboard on Hugging Face to find the best working models, specifically in the retrieval category. However, they need benchmarks for German, and the community members recommend using multilingual models like multilingual-e5-large. They also suggest creating a custom benchmark dataset by modifying the QA generation prompt to be in German. One community member shares a sample prompt and code for generating a dataset. Overall, the community members provide guidance on evaluating embedding performance and recommend exploring different approaches to find the best model for the given language and use case.

Useful resources
how could I benchmark multiple embedding models on my data?
Seems like the embedding model is the most important. If the context isnt precise, the llm has to go trough more data and increases the time needed. And if the correct context isn't found, there obviously is no answer.
1
W
C
R
13 comments
You can check the embedding model leaderboard fo find the best working models here: https://huggingface.co/spaces/mteb/leaderboard


All the models are tested on different techniques.
what category would I have to check?
Clustering?
I guess retrieval πŸ€” there's the paper for metrics in the link for more clarification.
sadly I need a benchmarks for german :(
all the multilingual works with german I think
if you are looking for local models, look at multilingual-e5-large
it's like the ada from openai
you could create your own benchmark dataset

We have some guides for measuring embedding performance here

https://docs.llamaindex.ai/en/stable/module_guides/evaluating/usage_pattern_retrieval.html
For you, you'd probably want to modify the qa generation prompt when creating a dataset
Here is the default prompt. you probably want to re-write it to be in german and to generate german data

Plain Text
my_prompt= """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""

dataset = generate_question_context_pairs(nodes, llm=llm, qa_generate_prompt_tmpl=my_prompt, num_questions_per_chunk=2)
yes they do, but I observed different results/quality.
currently I'm running sentence-transformers/sentence-t5-large
and got really solid results
thank you!
will try that
yeah, I think indeed making your own benchmark will be the best for your language. Good luck !
Add a reply
Sign up and join the conversation on Discord