Hi

At a glance

Hi
Is it possible to calculate
First for all the chunks in the document, call embedding model and calculate the embeddings, and store them in a dict / list
Second for each query, calculate its embedding
Third for each query do cosine sim with all the emebdding of the document
and while doing all of that can we also calculate the embedding time for each query and the embedding time for the whole dataset

20 comments

WWhiteFang_Jr

This happens in the first stage when you pass in your data to create index.
This happens when you ask a query with the help of query_engine or chat_engine.
Cosine similarity is calculated out on all docs but only the top_K is picked,
You can use Instrumentation module from llamaindex to get the verbose detail on everything.

https://docs.llamaindex.ai/en/stable/examples/instrumentation/instrumentation_observability_rundown/

rrishi

So I will be able to calculate the time it takes to calculate the embeddings for each module

rrishi

????

rrishi

@WhiteFang_Jr

rrishi

I am sorry how will the instrumentation module help in calculating the time for embedding a query or a whole dataset

WWhiteFang_Jr

Yeah check the link shared above, it has embedding start event and end event

WWhiteFang_Jr

Instrumentation module captures all the events that happens inside llamaindex universe.

Since embedding starts and end in this universe so you can get that details here too.

rrishi

Thanks alot man

rrishi

I also wanted to know if there is a way we can support all the dataset that mteb retreival has ?

rrishi

@WhiteFang_Jr

rrishi

I want to run evaluation over all the datasets supported by mteb via llamaindex

WWhiteFang_Jr

You can checkout evaluation section: https://docs.llamaindex.ai/en/stable/module_guides/evaluating/

There are lots of modules under it for different kind of evaluations.

rrishi

@WhiteFang_Jr I had another doubt
assume I have a text file which i want to use as the dataset
so I want to create chunks of each and every line
like for each sentence there is one chunk'

rrishi

and is it a good strategy to do this

rrishi

1 line per chunk

WWhiteFang_Jr

It'll depend on what you are evaluating.
Are you evaluating the response or the retrieved context?

rrishi

retrieved context

WWhiteFang_Jr

If you are evaluating the retrieved context, then you'd want to match it with the dataset in case of a positive scenario.

So if your ground truth is covered in a single line then it's fine else the quality of evaluation will drop

rrishi

Is there any embedding benchmarking tool which uses llamaindex libraries and gives out metrics like hit rate , mrr
@WhiteFang_Jr

WWhiteFang_Jr

Did you check the retrieval example in this link?
It does check these parameters which you have mentioned!

Add a reply

Find answers from the community

Hi