Find answers from the community

Updated 3 months ago

ravitheja 0475 jerryjliu98 9313 Logan M

What is the use case of SentenceEmbeddingOptimizer?

I have used it with GPTPineconeIndex in query, But i see token consumption has increased instead of decreasing
When used with following parameters
Original -> (LLM: 2342, Embedding: 7)
  1. threshold_cutoff=0.7 - Total LLM token usage is increased (LLM: 2720, Embedding: 7)
  2. percentile_cutoff=0.5, threshold_cutoff=0.7 - Total LLM token usage is reduced then test 1 but It is still more than the previous token consumption
  3. percentile_cutoff=0.8, threshold_cutoff=0.7 - Token consumption is reduced than original but model hallucinated and generated the wrong answer (LLM: 2248, Embedding: 7)
  4. threshold_cutoff=0.8 - Error: optimizer returned zero sentences
Any help over here to reduce token consumption?
2
L
A
k
19 comments
The embedding optimizer (as far as I'm aware) reduces token usage for embeddings, while trying to make the embeddings better represent the text by cleaning the text

For reducing LLM token usage, your best bet is playing with the chunk size. But tbh 2000 token usage per query is actually already pretty low.

You could also look into caching queries (and maybe llama index will have internal caching someday too 🙏 )
2000 token usage is from the QA prompt, then refine uses another two query to generate the final response
@Abhishek interesting. yeah the sentence embedding optimizer is meant to reduce LLM token usage, by using embeddings similarity to filter out irrelevant sentences. do you have a snippet of code on how you're using this? i'm surprised the LLM usage is higher for some cases, this shouldn't happen
@jerryjliu0 apologies for not sharing the code snippet before

Plain Text
index = GPTPineconeIndex(
    [],
    pinecone_index=self.pinecone_index,
    namespace=organisation,
)
response = index.query(
    query_str="What is difference between sparse and dense vectors?",
    similarity_top_k=3,
    text_qa_template=load_chat_prompt(),
    service_context=service_context,
    optimizer=SentenceEmbeddingOptimizer(
        percentile_cutoff=0.5,
        threshold_cutoff=0.7,
    ),
)

Here is the code snippet, Thanks
@jerryjliu0 @ravitheja
@Abhishek @kartik9257 @ravitheja do you actually have test data i can play around with? I have a slight suspicion on what might be the issue, but it's not confirmed
some followup comments / questions:
  • Can you try using the SentenceEmbeddingOptimizer on its own? You can call optimizer.optimize(QueryBundle(query_str), source_text). You can fetch source_text from response.source_nodes in the above response. This way you can see which specific sources the optimizer uses more tokens
  • do you need to use the optimizer? is this mostly just to save token usage/cost? we will definitely try to fix the issue, but in the meantime i wonder if it's blocking you in some way
yes its mostly save token cost - we're doing $350/day so want to save money
cc @Hongyi Shi as well if he happens to be free, since he's the orig author
Hmm I don’t think percentile cutoff and threshold cutoff are meant to be used together but it’s a minor issue
I generally only use percentile cutoff = 0.5 for example to only pull half the sentences from a chunk
Also I need to double check if it works with pinecone index I haven’t tried before
Thanks for your help @Hongyi Shi @jerryjliu0 !
@Hongyi Shi Can you let me know your test results with Pinecone?
@Abhishek here are my results the optimizer seems to be working fine. I suggest using only percentile_cutoff for now as that seems to work best for directly reducing token usage
Plain Text
 from gpt_index.optimization.optimizer import SentenceEmbeddingOptimizer
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
print("Without optimization")
response = city_indices["Boston"].query(
    "Tell me about the arts and culture of Boston",
    service_context=service_context
)
print(str(response))
print("With optimization")
response = city_indices["Boston"].query(
    "Tell me about the arts and culture of Boston",
    service_context=service_context,
    optimizer=SentenceEmbeddingOptimizer(percentile_cutoff=0.5)
)
print(str(response)) 

Plain Text
 Without optimization
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 4213 tokens
> [query] Total LLM token usage: 4213 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
> [query] Total embedding token usage: 9 tokens
With optimization
INFO:gpt_index.optimization.optimizer:> [optimize] Total embedding token usage: 0 tokens
> [optimize] Total embedding token usage: 0 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 1940 tokens
> [query] Total LLM token usage: 1940 tokens
INFO:gpt_index.token_counter.token_counter:> [query] Total embedding token usage: 9 tokens
> [query] Total embedding token usage: 9 tokens
This is using PineconeIndex.query
Thank you @Hongyi Shi, It works with only using percentile_cutoff
Add a reply
Sign up and join the conversation on Discord