Hey people.

At a glance

The community member is looking to take advantage of vLLMs continuous batching with the llama_index LLM for metadata extraction and summarization on a large number of documents on a rented A100 GPU. They are using the Mistral-7B-Instruct-v0.2 LLM and a specific Docker container.

The comments discuss the performance of the metadata extraction and summarization process. One community member notes that the bottleneck is in the response_synthesis on tree summarize mode, and suggests adding more async support for insert to improve performance. Another community member recommends using a TreeSummarize response synthesizer with use_async=True to improve speed.

One community member is looking for resources on using vLLM with the llama_index and llama2 libraries, and requests sample notebooks.

WWizboar

Hey people.

How can I take advantage of vLLMs continous batching with the llama_index LLM? I want to do metadata extraction and summarization on a large number of documents on a rented A100 as fast as possible.

Using Mistral-7B-Instruct-v0.2 as my LLM with the following container.

docker run --gpus all -p 8000:8000 ghcr.io/mistralai/mistral-src/vllm:latest \
                   --host 0.0.0.0 \
                   --model mistralai/Mistral-7B-Instruct-v0.2 \
                   --tensor-parallel-size 1

6 comments

LLogan M

tbh I have no idea how their batching works

But I do know that metadata extraction now will make a bunch of requests in async. But it's limited to a set num_workers (default is 4 ongoing concurrent requests)

WWizboar

I am using the DocumenSummaryIndex so I believe the bottle neck is the response_synthesis on tree summarise mode.

Running a vLLM server on an A100

Plain Text

Time to synthesize: 211.95126565685496
Time to add to docstore: 0.6406434010714293
Time to embed: 2.62248971988447

Also, looking at the indexes there is probably an oportunity to add some more async support for insert so that the the asynthesize and aembed methods are called on insert.

WWizboar

I think it make sense to split all these things up as seperate scripts and try batch them.

LLogan M

I think if you pass in a TreeSummarize response synthesizer with use_async=True set it should be quite a bit faster

WWizboar

Currently already doing that.

AAr#9696

Hi @Wizboar, currently I'm looking for a resource to check vLLM with the llama index, llama2. Can u share some resources on where to start? If possible need sample notebook source.

Add a reply

Find answers from the community

Hey people.