Use vLLM over the openai api -- if you aren't using vLLM over a sever, async is not possible
@Logan M I’m using vLLM over the OpenAI API, it works fine on langchain but I don’t why I can’t make the streaming work in llama index
from llama_index.llms import OpenAILike
llm = OpenAILike(
model="Qwen/Qwen-1_8B",
api_base="http://localhost:8000/v1",
api_key="fake",
api_type="fake",
max_tokens=256,
temperatue=0.5,
)
You defined the LLM something like that, using the LlamaIndex LLM class?
@Logan M I’ve defined it using the langchain vLLM class then the langchainLLM in llama index, could it be because of this ?
Also, I’m wondering why did you define a check for the openai class in llama index where the model name needs to be from OpenAI ? Because I think vLLM api is working directly with the class no ?
that could definitely be the issue
I'm not sure what you mean by this? If you are using an openai-like API, use the OpenAILike class
It’s still not working, I don’t why it get in an endless loop, where the model never reply, I’m adding the code to see if I did something wrong
from llama_index.prompts import PromptTemplate
from llama_index.indices.postprocessor import SentenceTransformerRerank
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
llm = OpenAILike(
api_key=openai_api_key,
api_base=openai_api_base,
model=model,
max_tokens = 250
)
qa_prompt = PromptTemplate(
"[INST] <<SYS>> You are an helpful assistant <</SYS>> \n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Query: {query_str}\n"
"Answer: [/INST]")
storage_context = StorageContext.from_defaults(persist_dir="vector-store")
embed_model = HuggingFaceEmbedding(model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
service_context = ServiceContext.from_defaults(chunk_size=256,chunk_overlap=0.2, llm=llm, embed_model = embed_model)
new_index = load_index_from_storage(storage_context,service_context = service_context)
rerank = SentenceTransformerRerank(model = "cross-encoder/ms-marco-MiniLM-L-12-v2", top_n = 3)
query_engine = new_index.as_query_engine(similarity_top_k = 15, node_postprocessors = [rerank], text_qa_template=qa_prompt,streaming=True,use_async=True)
chat_engine = new_index.as_chat_engine(streaming=True, chat_mode='context')
############ THIS DOESN'T WORK ############
streaming_response = await chat_engine.astream_chat("Your message here")
text = ""
async for token in streaming_response.async_response_gen():
text += token
print(text, end="")
############ THIS WORKS ############
streaming_response = chat_engine.stream_chat("Your message here")
for token in streaming_response.response_gen:
print(token, end="")
############ THIS WORKS AS WELL ############
print(await chat_engine.achat("Your message here"))
I think VLLM applies templates automatically if you declare it as a chat model
Try this instead
llm = OpenAILike(
api_key=openai_api_key,
api_base=openai_api_base,
model=model,
max_tokens=250,
is_chat_model=True,
)
...
chat_engine = new_index.as_chat_engine(
chat_mode="condense_plus_context",
similarity_top_k=15,
node_postprocessors=[rerank],
streaming=True,
use_async=True
)
Still no luck, I’m checking the vLLM api and it doesn’t even trigger an event with the async streaming for the chat engine. Is it an issue linked to vLLM you think or rather to the astream_chat function ?
hmmm, let me try to reproduce
I’m using the base mistral instruct v2
I think the issue is less model related and more usage related?
This works fine but it’s not in async, is it ?
Cause what’s in trying to do is astream_chat not just stream which works just fine !
Thanks I’ll try this tomorrow !
embedding = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cpu")
llm = OpenAILike(
model="Qwen/Qwen-1_8B",
api_base="http://127.0.0.1:8000/v1",
api_key="fake",
api_type="fake",
max_tokens=256,
temperatue=0.5,
is_chat_model=True,
)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embedding,
)
set_global_service_context(service_context)
chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")
response = await chat_engine.astream_chat("What did the author do growing up?")
async for token in response.async_response_gen():
print(token, end="", flush=True)
This seems to work :eyesshaking:
I’ve tried the notebook but it’s really laggy when I use async, I don’t understand why, it takes between 45 seconds to 3 minutes to generate the answer but when I just stream in sync it takes less than 10 seconds
Why would the async make it so much slower
its slower in the notebook? or in your actual code?
I don't really know either -- its just wrapping the openai client 🤔 so theres not really any crazy code going on (and openai is by far our most tested module)
In the notebook, in my code it doesn’t run at all.
The bug seems to be only with local llm though, I haven’t got my hand on an OpenAI api key but I might try later to see if it works fine (which should be since it works for the sec example)
Hello @Logan M ,
I’ve updated my version of llama index, but the pr does not seem to fix anything for me :/
I’m not sure if the issue is my utilisation of the fonctions or an issue between vLLM and llama index.
I really wish I knew what was wrong, but I really don't.
If i scrape the corners of my debugging brain
a) when you run async, is another blocking process taking over when you await? That would explain why non-async is faster, since it doesn't give up control of the event loop
b) are you hosting vllm on the same machine you are running llama-index on? There could be some weird interaction there maybe?
If you use the llm directly, does it reproduce?
resp = await llm.astream_complete("Tell me a poem about raining cats and dogs")
for r in resp:
print(r.delta, end="", flush=True)
a) no there’s no other blocking process. It works fine in async when the stream is off. Speed is equal when I stream or use async, just slower when I use both.
B) Yes, but I can try to host vLLM on my gpu and use another machine for querying.
c) when using the llm directly it work perfectly. It’s just when using a chat_engine.
My next step would be to fork the repo, and make some changes to the OpenAI class directly to allow vLLM connexion. Right now there is a check for the model name thst prevents this. Hopefully this would solve the issue, otherwise I’ve got no idea where the problem could be
Oh, if you use the OpenAILike
class and pass in model name, it works fine.
Hmm, so if it works fine with the LLM directly, then yea the issue is with the chat engine. I suspect its likely related to how writing to chat history and streaming interact
I wish I had more time to debug this, its been a wild few weeks. We are merging/releasing a 2 million line PR today lol (v0.10.0)
I was thinking maybe the problem was coming from this class since it seems to work fine with the OpenAI class.
Okay no worries, I can create an issue on GitHub if you want so you can check it later
If you check out the source code, its a super light wrapper. I don't think it would be causing the issues we are seeing
Yea a github issue works too. And if you find out anything else from debugging, let me know as well 🙂
I agree, but it was the only explanation I could find as to why it works with OpenAI API and not vLLM API 😅