Find answers from the community

Updated last year

Hello all

Hello all,
I’m wondering if there is a way to run llm call in parallel when using a local model ? I’m using Llama 13B on a single V100 GPU and I’ve tried multi threading but it’s give the same run time as when I was using asynchronous call. Did someone successfully manage to run several process at the same time on a single gpu ? Thanks in advance
L
E
T
12 comments
Hmm, it's pretty tricky
you may have to look into using something like Ray
tbh I'm not even sure if it's possible lol
I think you can open an issue here to start:
I think it's a general limitation with pytorch/transformer models, parallel inference usually means parallel copies of the model, or some other sneaky tricks that libraries like Ray can do
I see, would batching make inference faster ?
batch would be faster yes (assuming you have the memory for it)

Normally, people run their LLM on a server that handles batches (torch serve, huggingface TGI, vLLM)
Is there example out there with vLLM and llama index or are you guys considering implementing vLLM for llama_index ?
in progress: https://github.com/run-llama/llama_index/pull/7973

But you can use the vLLM integration from langchain in llamaindex too
just make more... quirks that way haha but it basically works out of the box
Great thanks, I’ll try it this way !
Add a reply
Sign up and join the conversation on Discord