I'm trying to come up with a set of steps to deploy an inference model on GPU in Azure that can scale to zero when not in use and spin-up when the endpoint is called.
This seems like it would be a common problem, because there are many companies that want a closed-off ChatGPT with RAG to prevent data leakage. Additionally, GPUs are expensive, so it is ideal to pay for only the compute that is used. I am assuming that Kubernetes is the best approach, but Kubernetes is not easy to work with directly for many reasons. I would therefore, expect that there is an existing framework or solution that makes this process easy. Are there some simple solutions to this problem?
@Logan M thank you, as far as I can tell, there are no solutions with Azure at this moment, aside from maybe Nvidia Triton Inference Server which can be launched on Azure. I'm not certain it can scale to zero, but it offers easier control over setup and scaling, from what I can tell. After some more searching it seems like there are some frameworks that I might be able to launch on Azure which sit on top of Kubernetes that make it easier to manage. Currently researching Fermyon and Knative.