Can anyone explain the custom embedding

BBlake

Can anyone explain the 'custom_embedding_strs' in the new querybundle feature?

I don't quite get what 'list of strings used for embedding the query' means.

Attachment

21 comments

ddisiok

hey @Blake, I'll try to give more clarity here.

BBlake

Hey!

ddisiok

Using GPTSimpleVectorIndex as an example, the original query string is used in two ways:

we calculate an embedding for the query string, and retrieve the best matching nodes via embedding similarity
we use the query string for the LLM to summarize and give the final output.

Sometimes you want the 2 to be different. So custom_embedding_strs allow you to modify the string used for 1.

ddisiok

Right now the main use-case for this feature is to support HyDE (hypothetical document embeddings). You can take a look at this tweet thread for more explanation/examples: https://twitter.com/jerryjliu0/status/1626255140209717248

ddisiok

And apologies, documentation could be improved!

BBlake

interesting!

say i have:

Attachment

BBlake

Would these 3 strings be grouped into a list & generate 1 embeddings vector to grab the k-nearest?

Or is it something different, like these 3 are embedded separately, and the middle/equidistant vector is used to grab the k-nearest?

ddisiok

Currently the default logic is to embed each separately, and use the "mean" embedding for calculating similarity.

We support customizing the aggregation function from "mean" to something else. But that configuration is not exposed at the Index API level yet. It's possible to subclass BaseEmbedding to do implement your desired behavior though.

BBlake

Got it!

BBlake

so in the Hyde twitter thread example, is the embeddings_strs[0] equivalent to custom_embedding_strs[0] - and hyde is hallucinating context to pass in for #1/k-nearest retreival?

Attachment

ddisiok

Yes, exactly.

BBlake

cool

BBlake

1 more question:

I'm trying to create a legal assistant question & answer on ~1 million cases/legislation documents using GPT Index. e.g. "Summarize this law & cite relevant cases"

Any insight on what tools/classes to use to get the best answers per api token spend? e.g. GPTSimpleVectorIndex in combination with ___

BBlake

I've tested including a knowledge graph tree string in my custom_embedding_strs but it didn't seem to improve answers

Attachment

ddisiok

Could you maybe describe what the current failure case is? Would be super helpful for me to help diagnose

BBlake

Great question

BBlake

1) i'm trying to reduce my query api spend without reducing quality of answers much - so $ cost is an issue

BBlake

e.g. i'm spending roughly $0.10 per query right now

Attachment

ddisiok

looks like the cost is mostly coming from the LLM call (since embedding calls are quite a bit cheaper)

ddisiok

I think we use Davinci for LLM call by default, which costs 0.0200 / 1K tokens. I'd recommend trying a cheaper LLM model and see if the quality is still acceptable.

BBlake

will do, thank you!

Add a reply

Find answers from the community

Can anyone explain the custom embedding