Find answers from the community

Updated 3 months ago

Costs

Hello,
I had a little confusion over pricing while using llama-index.
Currently, I think with each request - I am getting charged for "text-davincii" which is $0.02/1000 tokens and "text-embedding-ada-002-v2" which is $0.0004/1000 tokens. I had given a research document pdf as input. I was charged $0.1 for one request. (Roughly, 4000 tokens for text-davincii and 16000 tokens for "text-embedding-ada-002-v2"?

Will I be charged for embedding for each request?
Also, if I use "gpt-3.5-turbo" - will it be a better option?
Finally, how to reduce the number of tokens used so that I will be charged less. One approach which I am thinking of is to use vector search engine (like qdrant or faiss) to cache "questions asked and responses given by GPT". So, if a similar question is asked by any user next time, answer can be given directly by cache instead of using tokens.
Any help would be appreciated. Thank you.
1
L
B
m
19 comments
Only for index construction will you see a large embedding cost. Then, queries only need to embed the query text (usually very short)

The easiest way to lower token costs (for davinci and gpt-3.5) is to lower the chunk size in the service context

service_context = ServiceContext.from_defaults(..., chunk_size_limit=2000)

But the smaller the chunk size, the more you might need to increase the top k to find the right answer. Bit of a balancing act
@mathada @Logan M To add to this, use response.source_nodes to scientifically look at your selected nodes. If not enough, increase top k if too many, decrease top k.
Thanks @Logan M and @BioHacker .
Few observations.
  • Using "chunk_size_limit" and similarity_top_k" increases response time.
  • "gpt-3.5-turbo" model takes really really long time to give response (avg. response time = 30 seconds as compared to 3 seconds for "text-davinci" model)
  • "gpt-3.5-turbo" results are much worse than "text-davinci" model.
Please provide details on the document and your query. My experience has been opposite while using a wide variety of queries and docs.
@BioHacker
This was the document used.
https://www.wbhealth.gov.in/uploaded_files/corona/Management_Protocol_for_COVID-19_-_WB_(1).pdf
Converted this document to word file and then I used it.

Queries asked -
  1. what are the symptoms of covid?
  2. "text-davinci" model gives correct answer while "gpt-3.5-turbo" is not able to (gives a very generic answer. beats around the bush)
  1. who issued this management protocol guidelines?
  2. Both give correct answer
  1. what are the precautions to take in time of pregnancy?Β 
  2. "text-davinci" model gives correct answer while "gpt-3.5-turbo" says out of context question.
The below screenshot (image.png) shows "gpt-3.5-turbo" code along with response.

Whats your experience @BioHacker ? "gpt-3.5-turbo" gives better results and that too in less time? Would like to know. Thank you.
@mathada ohhh you are running on colab? I've heard many people complain about slow openai api usage on colab. I think OpenAI is throttling traffic from their servers tbh πŸ˜… locally, I think you'll have a better experience in terms of speed
Symptoms of covid
The symptoms of COVID-19 include fever, cough, sore throat, malaise, myalgia, difficulty breathing, respiratory rate greater than 24/min, SpO2 less than 94% in room air, altered sensorium (drowsiness, confusion, stupor), infiltrates on chest X-ray, altered liver or renal function test, and hyperthermia or hypothermia.
Pregnancy:
The precautions to take in time of pregnancy include: avoiding contact with people who are sick, washing hands frequently, avoiding large gatherings, wearing a face mask when in public, and avoiding travel. Additionally, pregnant women should seek medical advice if they experience any symptoms of COVID-19, such as fever, cough, or difficulty breathing. If a pregnant woman tests positive for COVID-19, she should follow the management protocol for all COVID-19 patients, which includes proper hand hygiene, wearing a three-layered medical mask, and proper cleaning of her breast and nipple before each feeding. Additionally, if vaginal delivery is feasible, it should be recommended, and if urgent delivery by Caesarean Section is needed, spinal anaesthesia should be used to minimise the need for general anaesthesia.
@mathada How are these answers?
The token usage for the first answer is 931 in under 5 seconds. The token usage for the second 1200 in under 5 seconds.
@Logan M I will certainly try running it in local.

@BioHacker perfect answers. These are using "gpt-3.5-turbo" model? 😱
Can you kindly share the code you useddd ??
What mistake I am doing πŸ€”πŸ€”
haha i know right you can get crazy outcome by getting the right settings.
I'll share with you in the DMs
If there are magic numbers, I’m interested too!
well theyre not that magical if you know what you are doing. But if you are using all default values, the results wont be optimized. A lot of trial and error goes into all this so just experiment πŸ™‚
Definitely have- my findings: larger chunks- less accurate. Smaller sometimes lose context. 2048 works well, but splits mid paragraph and sentences sometimes as per other thread πŸ˜‰ embedding size has to be 4095 (!) if azure.
The bigger the chunk the fewer β€œseams” eg splits in the middle of a paragraph, but the less accurate.
Add a reply
Sign up and join the conversation on Discord