LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

vLLM

vLLM

At a glance

The community member is trying to get a chat engine running with their vLLM OpenAI docker, but is facing issues. They cannot use the OpenAI import from LlamaIndex or openai due to credential issues. They get an error related to the metadata attribute of the OpenAI object. The community members try various approaches, including using the Vllm and VllmServer classes, but encounter further issues such as memory errors and bad requests. After some discussion, a community member suggests using the OpenAILike class, which seems to resolve the issues and allow the chat engine to work as expected.

Useful resources

·

I am trying to get a chat engine running with my vLLM OpenAI docker.
I can't find anything in the docs about using the vLLM OpenAI docker, but it should be like using OpenAI.
But nothing I do works like just using OpenAI according to the LlamaIndex docs.
I can't use the OpenAI import from LlamaIndex, because I have to use the credentials from the vLLM docker, which has an empty api key, which is not allowed.
Also I can't use the OpenAI import from openai, which I do use for the vLLM docker, which does work.
The reason why I can't use the import is because I get the error:

Plain Text

service_context = ServiceContext.from_defaults(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/service_context.py", line 184, in from_defaults
    llm_metadata=llm_predictor.metadata,
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 148, in metadata
    return self._llm.metadata
AttributeError: 'OpenAI' object has no attribute 'metadata'

My script is:

L

B

28 comments

Use the OpenAI like class. You might have to set a dummy api key

Been meaning to add docs, but

https://github.com/run-llama/llama_index/issues/9553#issuecomment-1858567809

I believe I already found a way, I am testing it:

Plain Text

from llama_index.llms.vllm import Vllm
vllm =Vllm(api_url=openai_api_base, model=model)
service_context = ServiceContext.from_defaults(
    #llm = client,
    llm = vllm,
    embed_model=embed_model,
)

I tried something similar before, but I believe I also added the api_key, which is not actually needed.
Now I am getting this with a 7b llm on a 24gb gpu:

Plain Text

INFO 12-17 17:57:43 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 12-17 17:58:17 llm_engine.py:222] # GPU blocks: 0, # CPU blocks: 2048
Traceback (most recent call last):
  File "/home/Josh-ee_Llama_RAG/vllm-openai.py", line 35, in <module>
    vllm =Vllm(api_url=openai_api_base, model=model)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 158, in __init__
    self._client = VLLModel(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
    engine = cls(*engine_configs,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 226, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Hope I am doing it right now, gonna check how I can fix this error

Current script

Yea that seems like its hitting vllm now, now it's just a matter of setting up your vllm server to not crash 😅

I think it is trying to load the llm again? Instead of interfering with it? I dont know if thats the way how it should run? Vllm already loads the model before i call this script, i feel like its loading it again locally, or maybe something else loads?

Anyway my pc cant handle it and crashes most of the time

I think you want to use the VllmServer class, not vLLM?

https://github.com/run-llama/llama_index/blob/4408c683fbea6971fd4e907cad3c6934acb9a9a4/llama_index/llms/vllm.py#L278

Awesome, I don't know how I messed that one up, I tried it before, thank you! I should look more on the github page ^^
I am nearly there, I get a similar error in my original code and in this example code when trying to run the chat engine now

This is the error from my example code:

Plain Text

response = chat_engine.chat("What did Paul Graham do growing up")
  File "/usr/local/lib/python3.10/dist-packages/llama_index/callbacks/utils.py", line 39, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/chat_engine/condense_plus_context.py", line 283, in chat
    chat_response = self._llm.chat(chat_messages)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 97, in wrapped_llm_chat
    f_return_val = f(_self, messages, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 225, in chat
    completion_response = self.complete(prompt, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 223, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 350, in complete
    output = get_response(response)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm_utils.py", line 9, in get_response
    return data["text"]
KeyError: 'text'

This is what I see in on the vLLM server when running it

Attachment

Hmmm seems like api url is not quite correct? Seems like it didn't hit /v1/chat/completions

Ah okay I see it needs the full url unlike the other method I used, I cleaned up the script to use the new method
Right now I am using the correct full url, but it gives me 400 bad request, I don't know how I can check the request, I know the prompt and the model have to be passed

Attachment

Plain Text

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://172.20.0.3:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
print(model)

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model_name = 'BAAI/bge-small-en-v1.5'
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device='cuda',
    normalize='True'
)

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms.vllm import VllmServer

vllm = VllmServer(api_url=openai_api_base+"/chat/completions", model="mistralai/Mistral-7B-Instruct-v0.1")

service_context = ServiceContext.from_defaults(
    llm = vllm,
    embed_model=embed_model,
)

path = '/RAG_VectorDB/test/'
data = SimpleDirectoryReader(input_dir=path).load_data()
index = VectorStoreIndex.from_documents(data, service_context=service_context)

from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about an essay discussing Paul Grahams life."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=False,
)

response = chat_engine.chat("What did Paul Graham do growing up")
print(response)

Still giving the same error, even tho it should have the correct url now

Plain Text

Traceback (most recent call last):
  File "/home/Josh-ee_Llama_RAG/vllm-openai.py", line 54, in <module>
    response = chat_engine.chat("What did Paul Graham do growing up")
  File "/usr/local/lib/python3.10/dist-packages/llama_index/callbacks/utils.py", line 39, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/chat_engine/condense_plus_context.py", line 283, in chat
    chat_response = self._llm.chat(chat_messages)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 97, in wrapped_llm_chat
    f_return_val = f(_self, messages, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 225, in chat
    completion_response = self.complete(prompt, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 223, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 350, in complete
    output = get_response(response)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm_utils.py", line 9, in get_response
    return data["text"]
KeyError: 'text'

Bad request tells me the data passed to the server isn't in the right format or missing something, but I don't know how to check it

I'm afk from my desktop this weekend, so I can't try my own just yet

But feel free to dive into the code lol

Calls this helper function
https://github.com/run-llama/llama_index/blob/4408c683fbea6971fd4e907cad3c6934acb9a9a4/llama_index/llms/vllm.py#L349

Which is over here
https://github.com/run-llama/llama_index/blob/4408c683fbea6971fd4e907cad3c6934acb9a9a4/llama_index/llms/vllm_utils.py#L12

I tried to make it work, but I can't find out how, I will wait until you got time, if you are going to call it directly, can you also show me the snippet? I must be doing something wrong 😄

Will dig into this today!

that would be very lovely, i tried to call the vllm.complete function after initiazing the vllmserver(api_url, model), i also tried to call the http, i think i did get bad request with one of em, maybe i didnt specify the messages correctly, idk how the engine handles that anyway tbh, probably should check that, but i hoped it would "just" work ^^ anyway im curious what you will find and how you find it

in my original code the error is slightly different, I didn't post that error so here it is, maybe it gives more insight what goes wrong:

Plain Text

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
  File "/home/Josh-ee_Llama_RAG/test-gpu.py", line 170, in main
    for token in response.response_gen:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/llm.py", line 46, in gen
    for response in completion_response_gen:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 228, in wrapped_gen
    for x in f_return_val:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 373, in gen
    yield CompletionResponse(text=data["text"][0])
KeyError: 'text'

I changed the chat_engine.chat function to chat_engine._query_engine.query (which I use in my original code)
And that change does give me an output when I print(vars(chat_engine)) I get this:

I am still unsure what is being sent to the server, since it keeps saying bad request, I believe it has to be with what is being sent

Attachment

I am stuck at finding out how I should troubleshoot the generator, if that is where it goes wrong

i will be happy to just find out how to get a basic chat or query engine working on vllm

Well, I have no idea whats wrong with the streaming, BUT if you start the server with openai mode, you can use OpenAILike which works much better

In another terminal

Plain Text

python -m vllm.entrypoints.openai.api_server --model "mistralai/Mistral-7B-Instruct-v0.1" --trust-remote-code

Then in your code

Plain Text

from llama_index.llms import OpenAILike
from llama_index.prompts import PromptTemplate

llm = OpenAILike(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    api_base="http://localhost:8000/v1",
    api_key="fake",
    api_type="fake",
    max_tokens=256,
    temperatue=0.5,
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n")
)

I included the wrapper template needed for mistral instruct. Chat models can also use the messages_to_prompt function callback hook. Some examples in the notebooks here for these settings with different LLMs
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html#open-source-llms

Probably VllmServer should just extend OpenAILike to make life easy

Thank you so much ❤️ It is working like expected, just gotta clean a few things, but this is going the right direction. I can now ask questions on multiple clients and it works just fine 😄 Thank you so much for your time and patience

Add a reply

Sign up and join the conversation on Discord