Find answers from the community

Updated last year

vLLM

I am trying to get a chat engine running with my vLLM OpenAI docker.
I can't find anything in the docs about using the vLLM OpenAI docker, but it should be like using OpenAI.
But nothing I do works like just using OpenAI according to the LlamaIndex docs.
I can't use the OpenAI import from LlamaIndex, because I have to use the credentials from the vLLM docker, which has an empty api key, which is not allowed.
Also I can't use the OpenAI import from openai, which I do use for the vLLM docker, which does work.
The reason why I can't use the import is because I get the error:
Plain Text
service_context = ServiceContext.from_defaults(
  File "/usr/local/lib/python3.10/dist-packages/llama_index/service_context.py", line 184, in from_defaults
    llm_metadata=llm_predictor.metadata,
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llm_predictor/base.py", line 148, in metadata
    return self._llm.metadata
AttributeError: 'OpenAI' object has no attribute 'metadata'

My script is:
L
B
28 comments
Use the OpenAI like class. You might have to set a dummy api key
Been meaning to add docs, but
I believe I already found a way, I am testing it:
Plain Text
from llama_index.llms.vllm import Vllm
vllm =Vllm(api_url=openai_api_base, model=model)
service_context = ServiceContext.from_defaults(
    #llm = client,
    llm = vllm,
    embed_model=embed_model,
)

I tried something similar before, but I believe I also added the api_key, which is not actually needed.
Now I am getting this with a 7b llm on a 24gb gpu:
Plain Text
INFO 12-17 17:57:43 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
INFO 12-17 17:58:17 llm_engine.py:222] # GPU blocks: 0, # CPU blocks: 2048
Traceback (most recent call last):
  File "/home/Josh-ee_Llama_RAG/vllm-openai.py", line 35, in <module>
    vllm =Vllm(api_url=openai_api_base, model=model)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 158, in __init__
    self._client = VLLModel(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
    engine = cls(*engine_configs,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 226, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

Hope I am doing it right now, gonna check how I can fix this error
Current script
Yea that seems like its hitting vllm now, now it's just a matter of setting up your vllm server to not crash πŸ˜…
I think it is trying to load the llm again? Instead of interfering with it? I dont know if thats the way how it should run? Vllm already loads the model before i call this script, i feel like its loading it again locally, or maybe something else loads?
Anyway my pc cant handle it and crashes most of the time
Awesome, I don't know how I messed that one up, I tried it before, thank you! I should look more on the github page ^^
I am nearly there, I get a similar error in my original code and in this example code when trying to run the chat engine now

This is the error from my example code:
Plain Text
response = chat_engine.chat("What did Paul Graham do growing up")
  File "/usr/local/lib/python3.10/dist-packages/llama_index/callbacks/utils.py", line 39, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/chat_engine/condense_plus_context.py", line 283, in chat
    chat_response = self._llm.chat(chat_messages)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 97, in wrapped_llm_chat
    f_return_val = f(_self, messages, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 225, in chat
    completion_response = self.complete(prompt, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 223, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 350, in complete
    output = get_response(response)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm_utils.py", line 9, in get_response
    return data["text"]
KeyError: 'text'
This is what I see in on the vLLM server when running it
Attachment
image.png
Hmmm seems like api url is not quite correct? Seems like it didn't hit /v1/chat/completions
Ah okay I see it needs the full url unlike the other method I used, I cleaned up the script to use the new method
Right now I am using the correct full url, but it gives me 400 bad request, I don't know how I can check the request, I know the prompt and the model have to be passed
Attachment
image.png
Plain Text
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://172.20.0.3:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id
print(model)

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model_name = 'BAAI/bge-small-en-v1.5'
embed_model = HuggingFaceEmbedding(
    model_name=embed_model_name,
    device='cuda',
    normalize='True'
)

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.llms.vllm import VllmServer

vllm = VllmServer(api_url=openai_api_base+"/chat/completions", model="mistralai/Mistral-7B-Instruct-v0.1")

service_context = ServiceContext.from_defaults(
    llm = vllm,
    embed_model=embed_model,
)

path = '/RAG_VectorDB/test/'
data = SimpleDirectoryReader(input_dir=path).load_data()
index = VectorStoreIndex.from_documents(data, service_context=service_context)

from llama_index.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=3900)
chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    context_prompt=(
        "You are a chatbot, able to have normal interactions, as well as talk"
        " about an essay discussing Paul Grahams life."
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Use the previous chat history, or the context above, to interact and help the user."
    ),
    verbose=False,
)

response = chat_engine.chat("What did Paul Graham do growing up")
print(response)
Still giving the same error, even tho it should have the correct url now
Plain Text
Traceback (most recent call last):
  File "/home/Josh-ee_Llama_RAG/vllm-openai.py", line 54, in <module>
    response = chat_engine.chat("What did Paul Graham do growing up")
  File "/usr/local/lib/python3.10/dist-packages/llama_index/callbacks/utils.py", line 39, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/chat_engine/condense_plus_context.py", line 283, in chat
    chat_response = self._llm.chat(chat_messages)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 97, in wrapped_llm_chat
    f_return_val = f(_self, messages, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 225, in chat
    completion_response = self.complete(prompt, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 223, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 350, in complete
    output = get_response(response)
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm_utils.py", line 9, in get_response
    return data["text"]
KeyError: 'text'
Bad request tells me the data passed to the server isn't in the right format or missing something, but I don't know how to check it
I tried to make it work, but I can't find out how, I will wait until you got time, if you are going to call it directly, can you also show me the snippet? I must be doing something wrong πŸ˜„
Will dig into this today!
that would be very lovely, i tried to call the vllm.complete function after initiazing the vllmserver(api_url, model), i also tried to call the http, i think i did get bad request with one of em, maybe i didnt specify the messages correctly, idk how the engine handles that anyway tbh, probably should check that, but i hoped it would "just" work ^^ anyway im curious what you will find and how you find it
in my original code the error is slightly different, I didn't post that error so here it is, maybe it gives more insight what goes wrong:
Plain Text
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/chainlit/utils.py", line 39, in wrapper
    return await user_function(**params_values)
  File "/home/Josh-ee_Llama_RAG/test-gpu.py", line 170, in main
    for token in response.response_gen:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/llm.py", line 46, in gen
    for response in completion_response_gen:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/base.py", line 228, in wrapped_gen
    for x in f_return_val:
  File "/usr/local/lib/python3.10/dist-packages/llama_index/llms/vllm.py", line 373, in gen
    yield CompletionResponse(text=data["text"][0])
KeyError: 'text'
I changed the chat_engine.chat function to chat_engine._query_engine.query (which I use in my original code)
And that change does give me an output when I print(vars(chat_engine)) I get this:
I am still unsure what is being sent to the server, since it keeps saying bad request, I believe it has to be with what is being sent
Attachment
image.png
I am stuck at finding out how I should troubleshoot the generator, if that is where it goes wrong
i will be happy to just find out how to get a basic chat or query engine working on vllm
Well, I have no idea whats wrong with the streaming, BUT if you start the server with openai mode, you can use OpenAILike which works much better


In another terminal
Plain Text
python -m vllm.entrypoints.openai.api_server --model "mistralai/Mistral-7B-Instruct-v0.1" --trust-remote-code



Then in your code
Plain Text
from llama_index.llms import OpenAILike
from llama_index.prompts import PromptTemplate

llm = OpenAILike(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    api_base="http://localhost:8000/v1",
    api_key="fake",
    api_type="fake",
    max_tokens=256,
    temperatue=0.5,
    query_wrapper_prompt=PromptTemplate("<s>[INST] {query_str} [/INST] </s>\n")
)


I included the wrapper template needed for mistral instruct. Chat models can also use the messages_to_prompt function callback hook. Some examples in the notebooks here for these settings with different LLMs
https://docs.llamaindex.ai/en/stable/module_guides/models/llms.html#open-source-llms
Probably VllmServer should just extend OpenAILike to make life easy
Thank you so much ❀️ It is working like expected, just gotta clean a few things, but this is going the right direction. I can now ask questions on multiple clients and it works just fine πŸ˜„ Thank you so much for your time and patience
Add a reply
Sign up and join the conversation on Discord