Hello, I'm trying to deploy my LLM and I have a couple of question about it. 1) First, I've seen the Llama index starter pack and I was wondering if it was compatible with Kubernetes to scale ? 2) I'm using Llama 2 the 7B and 13B versions and If i want to have approximatively 10 users simultaneously (at most) do you guys know what infrastructure size should be used (for example 2 A100 40 GB) or at least how much GPU I should be dedicating per user ? Thanks a lot for the help π
The starter pack is just something I threw together as an example. I probably wouldn't use it for production π I would recommend using something like fastapi for the server, and a vectordb integration to hold your index data (qdrant, postgres, weaviate, chroma, etc.)
Alright thanks for the help, I guess Iβll do a rough estimate for the GPU consumption and will probably limit the number of iterations for chat purposes, something like after 10 iterations clear cache