Find answers from the community

Updated last year

Hello

Hello,
I'm trying to deploy my LLM and I have a couple of question about it.
1) First, I've seen the Llama index starter pack and I was wondering if it was compatible with Kubernetes to scale ?
2) I'm using Llama 2 the 7B and 13B versions and If i want to have approximatively 10 users simultaneously (at most) do you guys know what infrastructure size should be used (for example 2 A100 40 GB) or at least how much GPU I should be dedicating per user ?
Thanks a lot for the help πŸ™‚
L
T
3 comments
The starter pack is just something I threw together as an example. I probably wouldn't use it for production πŸ˜… I would recommend using something like fastapi for the server, and a vectordb integration to hold your index data (qdrant, postgres, weaviate, chroma, etc.)
Not sure on the resource requirements there for llama2 πŸ€”
Alright thanks for the help, I guess I’ll do a rough estimate for the GPU consumption and will probably limit the number of iterations for chat purposes, something like after 10 iterations clear cache
Add a reply
Sign up and join the conversation on Discord