Hello, I had a little confusion over pricing while using llama-index. Currently, I think with each request - I am getting charged for "text-davincii" which is $0.02/1000 tokens and "text-embedding-ada-002-v2" which is $0.0004/1000 tokens. I had given a research document pdf as input. I was charged $0.1 for one request. (Roughly, 4000 tokens for text-davincii and 16000 tokens for "text-embedding-ada-002-v2"?
Will I be charged for embedding for each request? Also, if I use "gpt-3.5-turbo" - will it be a better option? Finally, how to reduce the number of tokens used so that I will be charged less. One approach which I am thinking of is to use vector search engine (like qdrant or faiss) to cache "questions asked and responses given by GPT". So, if a similar question is asked by any user next time, answer can be given directly by cache instead of using tokens. Any help would be appreciated. Thank you.
Hello all, Is there a way to print only the top k chunks based on user query using llama index?
As far as my understanding, llama index creates an embedding of the user query and finds the top most similar <similarity_top_k> chunks (using cosine similarity). Is there a way to print what these top k chunks are?