Find answers from the community

Updated last year

Hello all

At a glance

Hello all,
I’m wondering if there is a way to run llm call in parallel when using a local model ? I’m using Llama 13B on a single V100 GPU and I’ve tried multi threading but it’s give the same run time as when I was using asynchronous call. Did someone successfully manage to run several process at the same time on a single gpu ? Thanks in advance

12 comments

LLogan M

Hmm, it's pretty tricky

LLogan M

you may have to look into using something like Ray

LLogan M

tbh I'm not even sure if it's possible lol

EEmanuel Ferreira

I think you can open an issue here to start:

EEmanuel Ferreira

https://github.com/facebookresearch/llama/issues

LLogan M

I think it's a general limitation with pytorch/transformer models, parallel inference usually means parallel copies of the model, or some other sneaky tricks that libraries like Ray can do

TThomas1234

I see, would batching make inference faster ?

LLogan M

batch would be faster yes (assuming you have the memory for it)

Normally, people run their LLM on a server that handles batches (torch serve, huggingface TGI, vLLM)

TThomas1234

Is there example out there with vLLM and llama index or are you guys considering implementing vLLM for llama_index ?

LLogan M

in progress: https://github.com/run-llama/llama_index/pull/7973

But you can use the vLLM integration from langchain in llamaindex too

LLogan M

just make more... quirks that way haha but it basically works out of the box

TThomas1234

Great thanks, I’ll try it this way !

Add a reply