Find answers from the community

Updated 2 months ago

Hey people.

Hey people.

How can I take advantage of vLLMs continous batching with the llama_index LLM? I want to do metadata extraction and summarization on a large number of documents on a rented A100 as fast as possible.

Using Mistral-7B-Instruct-v0.2 as my LLM with the following container.

docker run --gpus all -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest \ --host 0.0.0.0 \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --tensor-parallel-size 1
L
W
A
6 comments
tbh I have no idea how their batching works

But I do know that metadata extraction now will make a bunch of requests in async. But it's limited to a set num_workers (default is 4 ongoing concurrent requests)
I am using the DocumenSummaryIndex so I believe the bottle neck is the response_synthesis on tree summarise mode.

Running a vLLM server on an A100
Plain Text
Time to synthesize: 211.95126565685496
Time to add to docstore: 0.6406434010714293
Time to embed: 2.62248971988447


Also, looking at the indexes there is probably an oportunity to add some more async support for insert so that the the asynthesize and aembed methods are called on insert.
I think it make sense to split all these things up as seperate scripts and try batch them.
I think if you pass in a TreeSummarize response synthesizer with use_async=True set it should be quite a bit faster
Currently already doing that.
Hi @Wizboar, currently I'm looking for a resource to check vLLM with the llama index, llama2. Can u share some resources on where to start? If possible need sample notebook source.
Add a reply
Sign up and join the conversation on Discord