Find answers from the community

Updated 2 years ago

Can anyone explain the custom embedding

Can anyone explain the 'custom_embedding_strs' in the new querybundle feature?

I don't quite get what 'list of strings used for embedding the query' means.
Attachment
image.png
d
B
21 comments
hey @Blake, I'll try to give more clarity here.
Using GPTSimpleVectorIndex as an example, the original query string is used in two ways:
  1. we calculate an embedding for the query string, and retrieve the best matching nodes via embedding similarity
  2. we use the query string for the LLM to summarize and give the final output.
Sometimes you want the 2 to be different. So custom_embedding_strs allow you to modify the string used for 1.
Right now the main use-case for this feature is to support HyDE (hypothetical document embeddings). You can take a look at this tweet thread for more explanation/examples: https://twitter.com/jerryjliu0/status/1626255140209717248
And apologies, documentation could be improved!
interesting!

say i have:
Attachment
image.png
Would these 3 strings be grouped into a list & generate 1 embeddings vector to grab the k-nearest?

Or is it something different, like these 3 are embedded separately, and the middle/equidistant vector is used to grab the k-nearest?
Currently the default logic is to embed each separately, and use the "mean" embedding for calculating similarity.

We support customizing the aggregation function from "mean" to something else. But that configuration is not exposed at the Index API level yet. It's possible to subclass BaseEmbedding to do implement your desired behavior though.
so in the Hyde twitter thread example, is the embeddings_strs[0] equivalent to custom_embedding_strs[0] - and hyde is hallucinating context to pass in for #1/k-nearest retreival?
Attachment
image.png
Yes, exactly.
1 more question:

I'm trying to create a legal assistant question & answer on ~1 million cases/legislation documents using GPT Index. e.g. "Summarize this law & cite relevant cases"

Any insight on what tools/classes to use to get the best answers per api token spend? e.g. GPTSimpleVectorIndex in combination with ___
I've tested including a knowledge graph tree string in my custom_embedding_strs but it didn't seem to improve answers
Attachment
image.png
Could you maybe describe what the current failure case is? Would be super helpful for me to help diagnose
Great question
1) i'm trying to reduce my query api spend without reducing quality of answers much - so $ cost is an issue
e.g. i'm spending roughly $0.10 per query right now
Attachment
image.png
looks like the cost is mostly coming from the LLM call (since embedding calls are quite a bit cheaper)
I think we use Davinci for LLM call by default, which costs 0.0200 /β€Š1K tokens. I'd recommend trying a cheaper LLM model and see if the quality is still acceptable.
will do, thank you!
Add a reply
Sign up and join the conversation on Discord