I am evaluating various embedding models

At a glance

The community member is evaluating various embedding models and wants to modify a guide to change the embedding models while keeping other variables constant. The comments provide suggestions on how to set the embedding model globally or pass it to the constructor of VectorStoreIndex. They also discuss the MTEB leaderboard, the potential for gaming the leaderboard, and the importance of evaluating on a relevant dataset rather than relying solely on benchmarks. The community members also discuss the relative importance of embedding models, chunking, and indexing for retrieval performance, and suggest checking out Unstructured's blog for more information. Additionally, they discuss modifying the instructions for generating question-context pairs, the use of LLMs in the evaluation process, and the potential benefits of fine-tuning embedding models on a sample of the dataset.

Useful resources

MMitchMcD

I am evaluating various embedding models. What's the best way to modify this guide and change embedding models only while keeping all other variables constant? https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/
thank you

19 comments

RRohan

if you wanna set the embed model globally, then:

Plain Text

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding()

Settings.embed_model = embed_model

And if you wanna compare retrievers using different embedding models, then pass the embed_model to the constructor of VectorStoreIndex

Plain Text

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes, embed_model=embed_model)

JJasonV

And just in case you haven't seen it: https://huggingface.co/spaces/mteb/leaderboard

MMitchMcD

thank you. i 've been consulting with MTEB a lot. fwiw, some of the models that appear to be working quite well on legal data are not that high ranked. but i want to evaluate based on objective metrics

JJasonV

Understood and ... interesting. I think some folks might be gaming the leaderboard rankings. I think I saw folks complaining about that.

MMitchMcD

i dunno, i 've played with many embedding models, some might be better on some type of data but worse on other. legal and medical might be a bit tricky.

MMitchMcD

it could be that benchmarks that MTEB is using are very general

RRohan

True, that's why it's good to run the evaluations on your use case relevant dataset, instead of relying on benchmarks only 👍

MMitchMcD

i wonder if anyone did a research on which of the variables contribute the most to the best retrieval : embedding model, chunking, or indexing.

JJasonV

I'd checkout Unstructured's blog. They probably know a lot. https://unstructured.io/blog/understanding-what-matters-for-llm-ingestion-and-preprocessing?fob=xlv1WAMltuEEp1y4

MMitchMcD

Thank you. Very useful.

MMitchMcD

'llm' is used only for generating qa dataset, correct? it is not used in any evaluation step, right?

MMitchMcD

can i modify the instructions for the generate_question_context_pairs? some questions, if i use default settings, are not formulated as questions

RRohan

yes, you're right, the llm passed to the generate_question_context_pairs function is used to generate synthetic questions for all of the nodes.

And you can update the instructions prompt too.

Plain Text

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2,
    qa_generate_prompt_tmpl=custom_instructions
)

the custom prompt has to have two variables, context_str and num_questions_per_chunk

This is the default one:

Plain Text

DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""

MMitchMcD

thank you @Rohan one thing you may want to change in the generate_qa_embeddings_pairs function is the default upload of "qa_finetune_dataset.json" . took me a while to figure out why do i receive lyft/uber questions even though i created a new eval and train nodes with my own documents.

MMitchMcD

@Rohan finetuning some of the sentencetransformer embed models , even on a limited dataset, brings about 15-20% improvement in hit rate and mrr [i used llamaindex libraries for finetuning and evaluating], takes very little time. which makes me wonder, why not to fine tune for each large embed job using just a sample from that dataset. of course you will need to use the same FT model for queries but that's feasible too. what do you think ?

RRohan

I see, thanks for pointing this out

RRohan

true, llamaindex offers two methods for finetuning embeddings - fine tune the model and finetune an adapter(if you don't wann re-embed the docs). If you use the first method, then yeah, you'll have to embed all nodes with the fine-tuned model and embed the query with the same model. The adapter finetuning method is more flexible, but definitely at a cost. That is the first finetuning method yields better results.

Resources:
https://clusteredbytes.pages.dev/posts/2023/llamaindex-embedding-finetuning/
https://clusteredbytes.pages.dev/posts/2023/llamaindex-adapter-finetuning/

MMitchMcD

Thanks. How much of role the prompt and the format of q/a play a role? Should they ideally be aligned with how most likely users will be querying the dataset semantically?

MMitchMcD

is it possible / does it make sense to add an option to generate hard negative q/a pairs too to the training set?

Add a reply

Find answers from the community

I am evaluating various embedding models