ravitheja 0475 jerryjliu98 9313 Logan M

At a glance

What is the use case of SentenceEmbeddingOptimizer?

I have used it with GPTPineconeIndex in query, But i see token consumption has increased instead of decreasing
When used with following parameters
Original -> (LLM: 2342, Embedding: 7)

threshold_cutoff=0.7 - Total LLM token usage is increased (LLM: 2720, Embedding: 7)
percentile_cutoff=0.5, threshold_cutoff=0.7 - Total LLM token usage is reduced then test 1 but It is still more than the previous token consumption
percentile_cutoff=0.8, threshold_cutoff=0.7 - Token consumption is reduced than original but model hallucinated and generated the wrong answer (LLM: 2248, Embedding: 7)
threshold_cutoff=0.8 - Error: optimizer returned zero sentences

Any help over here to reduce token consumption?

19 comments

LLogan M

The embedding optimizer (as far as I'm aware) reduces token usage for embeddings, while trying to make the embeddings better represent the text by cleaning the text

For reducing LLM token usage, your best bet is playing with the chunk size. But tbh 2000 token usage per query is actually already pretty low.

You could also look into caching queries (and maybe llama index will have internal caching someday too 🙏 )

AAbhishek

2000 token usage is from the QA prompt, then refine uses another two query to generate the final response

kkartik9257

it should be the query and not embedding as per the docs.
https://gpt-index.readthedocs.io/en/latest/how_to/analysis/optimizers.html

kkartik9257

@jerryjliu0

jjerryjliu0

@Abhishek interesting. yeah the sentence embedding optimizer is meant to reduce LLM token usage, by using embeddings similarity to filter out irrelevant sentences. do you have a snippet of code on how you're using this? i'm surprised the LLM usage is higher for some cases, this shouldn't happen

AAbhishek

@jerryjliu0 apologies for not sharing the code snippet before

Plain Text

index = GPTPineconeIndex(
    [],
    pinecone_index=self.pinecone_index,
    namespace=organisation,
)
response = index.query(
    query_str="What is difference between sparse and dense vectors?",
    similarity_top_k=3,
    text_qa_template=load_chat_prompt(),
    service_context=service_context,
    optimizer=SentenceEmbeddingOptimizer(
        percentile_cutoff=0.5,
        threshold_cutoff=0.7,
    ),
)

Here is the code snippet, Thanks

kkartik9257

@jerryjliu0 @ravitheja

jjerryjliu0

@Abhishek @kartik9257 @ravitheja do you actually have test data i can play around with? I have a slight suspicion on what might be the issue, but it's not confirmed

jjerryjliu0

some followup comments / questions:

Can you try using the SentenceEmbeddingOptimizer on its own? You can call optimizer.optimize(QueryBundle(query_str), source_text). You can fetch source_text from response.source_nodes in the above response. This way you can see which specific sources the optimizer uses more tokens
do you need to use the optimizer? is this mostly just to save token usage/cost? we will definitely try to fix the issue, but in the meantime i wonder if it's blocking you in some way

kkartik9257

yes its mostly save token cost - we're doing $350/day so want to save money

jjerryjliu0

makes sense

jjerryjliu0

cc @Hongyi Shi as well if he happens to be free, since he's the orig author

HHongyi Shi

Hmm I don’t think percentile cutoff and threshold cutoff are meant to be used together but it’s a minor issue

HHongyi Shi

I generally only use percentile cutoff = 0.5 for example to only pull half the sentences from a chunk

HHongyi Shi

Also I need to double check if it works with pinecone index I haven’t tried before

AAbhishek

Thanks for your help @Hongyi Shi @jerryjliu0 !
@Hongyi Shi Can you let me know your test results with Pinecone?

HHongyi Shi

@Abhishek here are my results the optimizer seems to be working fine. I suggest using only percentile_cutoff for now as that seems to work best for directly reducing token usage

Plain Text

 from gpt_index.optimization.optimizer import SentenceEmbeddingOptimizer
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
print("Without optimization")
response = city_indices["Boston"].query(
    "Tell me about the arts and culture of Boston",
    service_context=service_context
)
print(str(response))
print("With optimization")
response = city_indices["Boston"].query(
    "Tell me about the arts and culture of Boston",
    service_context=service_context,
    optimizer=SentenceEmbeddingOptimizer(percentile_cutoff=0.5)
)
print(str(response))

Plain Text

 Without optimization
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 4213 tokens
> [query] Total LLM token usage: 4213 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
> [query] Total embedding token usage: 9 tokens
With optimization
INFO:gpt_index.optimization.optimizer:> [optimize] Total embedding token usage: 0 tokens
> [optimize] Total embedding token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 1940 tokens
> [query] Total LLM token usage: 1940 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
> [query] Total embedding token usage: 9 tokens

HHongyi Shi

This is using PineconeIndex.query

AAbhishek

Thank you @Hongyi Shi, It works with only using percentile_cutoff

Add a reply

Find answers from the community

ravitheja 0475 jerryjliu98 9313 Logan M