I also have a few concerns about the concurrency. I have an app with about 200k active users in a day, about 30k online users in an hour, 1.5k online users in a minute. I want to give users some AI related functionalities backed by llama index+openai api. assume 10% active users will use. then it would be 150 calls in a min, 3k calls in an hour, 20k calls in a day, to llama index. is this even practical? what level of hardwares will be needed? thanks in advance for any suggestions!
Could be practical, depends on what your llama index app is doing it.
You're saying 20k calls llama-index in a day but llama index will pass those off to open ai, so, the biggest bottle neck / price point will probably be there.
depends on if you're worried about open ai error rate, pricing, db (vector stores) etc?
thanks @bmax a ton! what about latency? the latency will mostly depends on the openai api as well? llama index itself won't contribute much in the total latency?
it is an app designed for sports fans, where users can come together to discuss, interact, and compete with each other. Now, I'm planning to introduce a chatbot into the app to answer some of the common questions from users.