Hello all, I’m wondering if there is a way to run llm call in parallel when using a local model ? I’m using Llama 13B on a single V100 GPU and I’ve tried multi threading but it’s give the same run time as when I was using asynchronous call. Did someone successfully manage to run several process at the same time on a single gpu ? Thanks in advance
I think it's a general limitation with pytorch/transformer models, parallel inference usually means parallel copies of the model, or some other sneaky tricks that libraries like Ray can do