Find answers from the community

Updated 3 months ago

I am evaluating various embedding models

I am evaluating various embedding models. What's the best way to modify this guide and change embedding models only while keeping all other variables constant? https://docs.llamaindex.ai/en/stable/examples/evaluation/retrieval/retriever_eval/
thank you
R
J
M
19 comments
if you wanna set the embed model globally, then:

Plain Text
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

embed_model = OpenAIEmbedding()

Settings.embed_model = embed_model


And if you wanna compare retrievers using different embedding models, then pass the embed_model to the constructor of VectorStoreIndex

Plain Text
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding()
index = VectorStoreIndex(nodes, embed_model=embed_model)
thank you. i 've been consulting with MTEB a lot. fwiw, some of the models that appear to be working quite well on legal data are not that high ranked. but i want to evaluate based on objective metrics
Understood and ... interesting. I think some folks might be gaming the leaderboard rankings. I think I saw folks complaining about that.
i dunno, i 've played with many embedding models, some might be better on some type of data but worse on other. legal and medical might be a bit tricky.
it could be that benchmarks that MTEB is using are very general
True, that's why it's good to run the evaluations on your use case relevant dataset, instead of relying on benchmarks only πŸ‘
i wonder if anyone did a research on which of the variables contribute the most to the best retrieval : embedding model, chunking, or indexing.
Thank you. Very useful.
'llm' is used only for generating qa dataset, correct? it is not used in any evaluation step, right?
can i modify the instructions for the generate_question_context_pairs? some questions, if i use default settings, are not formulated as questions
yes, you're right, the llm passed to the generate_question_context_pairs function is used to generate synthetic questions for all of the nodes.

And you can update the instructions prompt too.
Plain Text
qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2,
    qa_generate_prompt_tmpl=custom_instructions
)

the custom prompt has to have two variables, context_str and num_questions_per_chunk

This is the default one:
Plain Text
DEFAULT_QA_GENERATE_PROMPT_TMPL = """\
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
{num_questions_per_chunk} questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."
"""
thank you @Rohan one thing you may want to change in the generate_qa_embeddings_pairs function is the default upload of "qa_finetune_dataset.json" . took me a while to figure out why do i receive lyft/uber questions even though i created a new eval and train nodes with my own documents.
@Rohan finetuning some of the sentencetransformer embed models , even on a limited dataset, brings about 15-20% improvement in hit rate and mrr [i used llamaindex libraries for finetuning and evaluating], takes very little time. which makes me wonder, why not to fine tune for each large embed job using just a sample from that dataset. of course you will need to use the same FT model for queries but that's feasible too. what do you think ?
I see, thanks for pointing this out
true, llamaindex offers two methods for finetuning embeddings - fine tune the model and finetune an adapter(if you don't wann re-embed the docs). If you use the first method, then yeah, you'll have to embed all nodes with the fine-tuned model and embed the query with the same model. The adapter finetuning method is more flexible, but definitely at a cost. That is the first finetuning method yields better results.

Resources:
https://clusteredbytes.pages.dev/posts/2023/llamaindex-embedding-finetuning/
https://clusteredbytes.pages.dev/posts/2023/llamaindex-adapter-finetuning/
Thanks. How much of role the prompt and the format of q/a play a role? Should they ideally be aligned with how most likely users will be querying the dataset semantically?
is it possible / does it make sense to add an option to generate hard negative q/a pairs too to the training set?
Add a reply
Sign up and join the conversation on Discord