Why is that contradictory? π In a majority of applications, you'll need to scale it somehow for production usage (i.e. in a k8s cluster, like EKS on AWS).
llama.cpp is cool, but could never support a production app right now (it can only process things sequentially, no dynamic batching, slower than actual CUDA implementations)
It's contradictory because people who want to run AI apps locally probably do it out of concerns about the cost of cloud computing, data privacy and security.
Putting your app on Amazon's property seemingly goes against that idea.
Fair enough. AWS was a suggestion because it's easy to spin up. You could of course spin up a local K8s cluster, but you'll also need a decent amount of hardware depending on the traffic you want to handle (but that's the nice thing about k8s, it runs on any server)