Find answers from the community

Updated 2 years ago

Hello

At a glance

Hello,
I'm trying to deploy my LLM and I have a couple of question about it.
1) First, I've seen the Llama index starter pack and I was wondering if it was compatible with Kubernetes to scale ?
2) I'm using Llama 2 the 7B and 13B versions and If i want to have approximatively 10 users simultaneously (at most) do you guys know what infrastructure size should be used (for example 2 A100 40 GB) or at least how much GPU I should be dedicating per user ?
Thanks a lot for the help 🙂

3 comments

LLogan M

The starter pack is just something I threw together as an example. I probably wouldn't use it for production 😅 I would recommend using something like fastapi for the server, and a vectordb integration to hold your index data (qdrant, postgres, weaviate, chroma, etc.)

LLogan M

Not sure on the resource requirements there for llama2 🤔

TThomas1234

Alright thanks for the help, I guess I’ll do a rough estimate for the GPU consumption and will probably limit the number of iterations for chat purposes, something like after 10 iterations clear cache

Add a reply