Token cost prediction

I am having a devilish time attempting to analyze the cost of a query with MockLLMPredictor. I keep getting

Plain Text

 --------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
File c:\Users\happy\Documents\Projects\askLavinia\.venv\lib\site-packages\tenacity\__init__.py:382, in Retrying.__call__(self, fn, *args, **kwargs)
    381 try:
--> 382     result = fn(*args, **kwargs)
    383 except BaseException:  # noqa: B902

File c:\Users\happy\Documents\Projects\askLavinia\.venv\lib\site-packages\llama_index\embeddings\openai.py:106, in get_embedding(text, engine, **kwargs)
    105 text = text.replace("\n", " ")
--> 106 return openai.Embedding.create(input=[text], model=engine, **kwargs)["data"][0][
    107     "embedding"
    108 ]

File c:\Users\happy\Documents\Projects\askLavinia\.venv\lib\site-packages\openai\api_resources\embedding.py:33, in Embedding.create(cls, *args, **kwargs)
     32 try:
---> 33     response = super().create(*args, **kwargs)
     35     # If a user specifies base64, we'll just return the encoded string.
     36     # This is only for the default case.

File c:\Users\happy\Documents\Projects\askLavinia\.venv\lib\site-packages\openai\api_resources\abstract\engine_api_resource.py:149, in EngineAPIResource.create(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)
    127 @classmethod
    128 def create(
    129     cls,
   (...)
    136     **params,
...
--> 326     raise retry_exc from fut.exception()
    328 if self.wait:
    329     sleep = self.wait(retry_state)

RetryError: RetryError[]

yet, the query works fine when I set up the query with: st.session_state['query_engine'] = index.as_query_engine(verbose=True) Has anyone gotten the ability to retrieve tokens and then figure out the cost ? Thank you.

24 comments

LLogan M

It says authentication error, on the embeddings it looks like 🤔

Did you also setup a mock embedding model?

HHappyDay

ok. Digging deeper, it is as if when it comes to the embedding, it doesn't find an openai key, although in a prior step i had os.environ["OPENAI_API_KEY"] set. honestly, i just wish this was a property of each api call that needs it. Then we'd know when we're being charged. ok, moving on. do you know why the Mock stuff mocks my openai key? Thank you.

LLogan M

There has been a lot of issues with openai keys lately, I think it might be related to langchain/openai packages. Just an FYI, especially when running in a notebook, I usually also set

Plain Text

import openai
openai.api_key = "sk-...."

HHappyDay

thank you. your reply is a good point. however, i am so very, very forgetful that I would just expose all my "secret" keys...

HHappyDay

@Logan M but you know what? That was the problem. I bet the way I had just copy pasted between .py and ipynb lost the string. Further proving the point of my teachers that there is a slight sniff of stewpidity in the air. Thank you.

HHappyDay

@Logan M Openai pricing is different on input than output. What does the number returned when

Plain Text

llm_predictor = MockLLMPredictor(max_tokens=256)
embed_model = MockEmbedding(embed_dim=1536)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

is used as parameters to the query engine? (Ideally a way to get the input and output count...?)... or are these just for retrieving and then there are more tokens for the "real" query? I am confused.

LLogan M

What version of llama-index are you on? The current version of token counting is technically deprecated

The new method gets both prompt and completion token counts, and lets you set the proper tokenizer (the old one just mixes everything into one count)
https://gpt-index.readthedocs.io/en/latest/how_to/callbacks/token_counting_migration.html

I'm realizing now I should update the docs page for token counting lol

HHappyDay

wow...isn't that fast and furious...i was using this reference doc: https://gpt-index.readthedocs.io/en/latest/how_to/analysis/cost_analysis.html (note the latest, so given you need token counting for costing... thank you.

LLogan M

yea thats the doc page I just noticed thats outdated, sorry about that!

HHappyDay

@Logan M Great doc is very difficult. My new challenge. It will take two messages. message 1: code:

Plain Text

import tiktoken
from llama_index.callbacks import CallbackManager, TokenCountingHandler
# you can set a tokenizer directly, or optionally let it default 
# to the same tokenizer that was used previously for token counting
# NOTE: The tokenizer should be a function that takes in text and returns a list of tokens
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("text-davinci-003").encode,
    verbose=True  # set to true to see usage printed to the console
)
callback_manager = CallbackManager([token_counter])
service_context = ServiceContext.from_defaults(callback_manager=callback_manager)

# also track prompt, completion, and total LLM tokens, in addition to embeddings
response = index.as_query_engine().query("What are the overtime policies?")
print('Embedding Tokens: ', token_counter.total_embedding_token_count, '\n',
      'LLM Prompt Tokens: ', token_counter.prompt_llm_token_count, '\n',
      'LLM Completion Tokens: ', token_counter.completion_llm_token_count, '\n',
      'Total LLM Token Count: ', token_counter.total_llm_token_count)

LLogan M

It will take two messages -> what do you mean here?

HHappyDay

oh, i just noticed there was a bug in the doc code:

Plain Text

callback_manager = CallbackManager([token_counter])

service_context = ServiceContext.from_defaults(callback_manager=callback_manager)

document = SimpleDirectoryReader("./data").load_data()

# if verbose is turned on, you will see embedding token usage printed
index = VectorStoreIndex.from_documents(documents)

see https://gpt-index.readthedocs.io/en/latest/how_to/callbacks/token_counting_migration.html the as_query_engine() doesn't pass in the service_context. No wonder I was getting zero's when I copy and paste.

LLogan M

Thats what i get for copy-pasting 🤦‍♂️

The notebook version sets a global service context, hence that got missed in the smaller demo
https://gpt-index.readthedocs.io/en/latest/examples/callbacks/TokenCountingHandler.html

HHappyDay

oh my. if only i had stumbled on that page....

HHappyDay

@Logan M question on prompt tokens. Here is the query:

Plain Text

response = index.as_query_engine(verbose=True, service_context=service_context).query("What are the overtime policies?")

given the rough rule of thumb (that I naively use) for word to token, I'm thinking the prompt token usage would be about 5-7 but here is what i get:

Plain Text

LLM Prompt Token Usage: 1945

so there is more to the prompt. is there an easy way to show what the full prompt is?

LLogan M

When you query, it retrieves text from your index, and inserts that text + your query string into a prompt template.

Hence, the token usage is larger than 5-7 😉

Plain Text

# Get the prompt text of the last LLM call
token_counter.llm_token_counts[-1].prompt

# Get the completion text of the last LLM call
token_counter.llm_token_counts[-1].completion

HHappyDay

oh. Right. Thank you.

HHappyDay

@Logan M that really helped. Thank you very much.

HHappyDay

here is my class for token counting. Thank you for your help:

Plain Text

class TokenCost:
    """
    A class used to calculate the token cost of an LLM model.

    Attributes
    ----------
    token_counter : TokenCountingHandler
        a TokenCountingHandler object that counts tokens in the model
    callback_manager : CallbackManager
        a CallbackManager object that manages callbacks for the token counter

    """
    def __init__(self, model_name, verbose=True):
        """
        Initializes the TokenCost object.
        Parameters
        ----------
        model_name : str
            The name of the model to be token counted.  Common names are 'text-davinci-003'
        verbose : bool, optional
            Whether to print the token counting progress to the console. Default is True.
        """
        self._callback_manager = None
        # Set up callback

        self.token_counter = TokenCountingHandler(
            tokenizer=tiktoken.encoding_for_model(model_name).encode, verbose=verbose
        )
        self.callback_manager = CallbackManager([self.token_counter])
    @property
    def callback_manager(self):
        return self._callback_manager
    @callbak_manager.setter
    def callback_manager(self, value):
        self._callback_manager = value
    @property
    def embedding_token_count(self):
        return self.token_counter.embedding_token_counts
    @property
    def prompt_token_count(self):
        return self.token_counter.prompt_llm_token_count
    @property

    def completion_token_count(self):
        return self.token_counter.completion_llm_token_count
    @property
    def total_token_count(self):
        return self.token_counter.total_llm_token_count
    @property
    def prompt(self):
        return self.token_counter.llm_token_counts[-1].prompt
    @property
    def completion(self):
        return self.token_counter.llm_token_counts[-1].completion

HHappyDay

test code:

Plain Text

from myutils import TokenCost, utils_load_index
from llama_index import ServiceContext, Prompt

model_name = "text-davinci-003"

token_cost = TokenCost(model_name, verbose=False)
service_context = ServiceContext.from_defaults(
    callback_manager=token_cost.callback_manager
)
index = utils_load_index("indices/vector_index")
# also track prompt, completion, and total LLM tokens, in addition to embeddings
PROMPT_TMPL_STR = (
    "Given this context information --> {context_str} <-- \n\n"
    "and no prior knowledge, "
    "answer the question: {query_str}. The response should be formatted as a list of bullet points.  Adhere to these guidelines:\n"
    "- bullet points start on new lines\n"
    "- each bullet point includes a fact and the article number where the fact is discussed\n"
    "- the text should be comprehensible to a high school student\n"
)

QA_TEMPLATE = Prompt(PROMPT_TMPL_STR)
response = index.as_query_engine(
    verbose=True, service_context=service_context, text_qa_template=QA_TEMPLATE
).query("What are the overtime policies?")
print(
    f"""
Embedding Tokens: {token_cost.embedding_token_count}
LLM Prompt Tokens: {token_cost.prompt_token_count}
LLM Completion Tokens: {token_cost.completion_token_count}
Total LLM Token Count: {token_cost.total_token_count}
{'*' * 50}
Prompt: 
{token_cost.prompt}
{'*' * 50}
Completion: 
{token_cost.completion}
"""
)

LLogan M

Very nice! 💪💪

HHappyDay

with that said, that's the token count. to get the cost, it is unfortunate there isn't an api to get to a database to get the costs of each model. but I think this is right:

Plain Text

{ "openai_LLMs": {
    "text-davinci-003":{"prompt":0.00002, "completion": 0.00002}, 
    "gpt4": {"prompt":0.00003, "completion": 0.00006 },
    "gpt-4-32k": {"prompt":0.00006, "completion": 0.00012},
    "gpt-3.5-turbo": {"prompt":0.0000015 , "completion": 0.000002 },
    "gpt-3.5-16K" : {"prompt":0.000003 , "completion": 0.000004  }
    }
}

then we can calculate the cost:

Plain Text

def calculate_cost(model_name, num_prompt_tokens, num_completion_tokens):
    """Calculate the total cost for a specified model based on the number of prompt and completion tokens."""

    # Load data from JSON file
    with open("openai_costs.json") as f:
        data = json.load(f)

    try:
        costs = data["openai_LLMs"][model_name]
        prompt_cost = costs["prompt"]
        completion_cost = costs["completion"]

        # Calculate total cost
        total_cost = (prompt_cost * num_prompt_tokens) + (
            completion_cost * num_completion_tokens
        )
        return total_cost
    except KeyError:
        return "Model not found in data."

ok. i'll shutup now.

HHappyDay

oh wait. before i shut up...this was helpful...it gives the names and encodings for available models:

Plain Text

from tiktoken.model import MODEL_TO_ENCODING

# Now MODEL_TO_ENCODING is a dictionary where keys are model names and values are their encodings
# We can transform it into a list of dictionaries

list_of_dicts = [{"model": k, "encoding": v} for k, v in MODEL_TO_ENCODING.items()]
print(list_of_dicts)

LLogan M

Yea very nice! Those costs look correct to me 🙂

Add a reply

Find answers from the community

Token cost prediction