LlamaIndex

Log inLog into community

Find answers from the community

Updated last year

Hello,

Hello,

At a glance

The community member is using vLLM and is trying to use async and streaming simultaneously with their vector store index. However, they are facing issues - when using a query engine, async is not supported, and when using a chat engine, they can either stream or use async, but not both together. They are seeking ideas on how to bypass this issue.

The comments suggest that the community member should use vLLM over the OpenAI API, and that if they are not using vLLM over a server, async is not possible. The community member confirms they are using vLLM over the OpenAI API, and it works fine on Langchain, but they are unable to make the streaming work in LlamaIndex.

The community member provides code examples and tries different approaches, including using the OpenAILike class and setting is_chat_model=True. However, they are still facing issues, with the model either getting stuck in an endless loop or being significantly slower when using async compared to just streaming.

The community members discuss potential reasons for the issue, such as interactions between the chat history and streaming, or problems with the local LLM setup. They also suggest trying to host vLLM on a separate machine and using the LLM directly to

Useful resources

·

Hello,
I’m currently using vLLM, and I’m trying to use async and streaming at the same time with my vector store index. Unfortunately, when it’s a query engine it’s not supported, and when I use a chat engine it either stream or use the async but both together don’t stream, it just use the async property. Any ideas on how to bypass this ?

L

T

43 comments

Use vLLM over the openai api -- if you aren't using vLLM over a sever, async is not possible

@Logan M I’m using vLLM over the OpenAI API, it works fine on langchain but I don’t why I can’t make the streaming work in llama index

Plain Text

from llama_index.llms import OpenAILike

llm = OpenAILike(
    model="Qwen/Qwen-1_8B",
    api_base="http://localhost:8000/v1",
    api_key="fake",
    api_type="fake",
    max_tokens=256,
    temperatue=0.5,
)

You defined the LLM something like that, using the LlamaIndex LLM class?

@Logan M I’ve defined it using the langchain vLLM class then the langchainLLM in llama index, could it be because of this ?

Also, I’m wondering why did you define a check for the openai class in llama index where the model name needs to be from OpenAI ? Because I think vLLM api is working directly with the class no ?

that could definitely be the issue

I'm not sure what you mean by this? If you are using an openai-like API, use the OpenAILike class

It’s still not working, I don’t why it get in an endless loop, where the model never reply, I’m adding the code to see if I did something wrong

from llama_index.prompts import PromptTemplate
from llama_index.indices.postprocessor import SentenceTransformerRerank

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

llm = OpenAILike(
api_key=openai_api_key,
api_base=openai_api_base,
model=model,
max_tokens = 250
)

qa_prompt = PromptTemplate(
"[INST] <<SYS>> You are an helpful assistant <</SYS>> \n"
"---------------------\n"
"{context_str}\n"
"---------------------\n"
"Query: {query_str}\n"
"Answer: [/INST]")

storage_context = StorageContext.from_defaults(persist_dir="vector-store")
embed_model = HuggingFaceEmbedding(model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2")
service_context = ServiceContext.from_defaults(chunk_size=256,chunk_overlap=0.2, llm=llm, embed_model = embed_model)
new_index = load_index_from_storage(storage_context,service_context = service_context)

rerank = SentenceTransformerRerank(model = "cross-encoder/ms-marco-MiniLM-L-12-v2", top_n = 3)
query_engine = new_index.as_query_engine(similarity_top_k = 15, node_postprocessors = [rerank], text_qa_template=qa_prompt,streaming=True,use_async=True)

chat_engine = new_index.as_chat_engine(streaming=True, chat_mode='context')
############ THIS DOESN'T WORK ############
streaming_response = await chat_engine.astream_chat("Your message here")
text = ""
async for token in streaming_response.async_response_gen():
text += token
print(text, end="")
############ THIS WORKS ############
streaming_response = chat_engine.stream_chat("Your message here")
for token in streaming_response.response_gen:
print(token, end="")
############ THIS WORKS AS WELL ############
print(await chat_engine.achat("Your message here"))

I think VLLM applies templates automatically if you declare it as a chat model

Try this instead

Plain Text

llm = OpenAILike(
    api_key=openai_api_key,
    api_base=openai_api_base,
    model=model,
    max_tokens=250,
    is_chat_model=True,
)
...
chat_engine = new_index.as_chat_engine(
  chat_mode="condense_plus_context",
  similarity_top_k=15, 
  node_postprocessors=[rerank], 
  streaming=True,
  use_async=True
)

Still no luck, I’m checking the vLLM api and it doesn’t even trigger an event with the async streaming for the chat engine. Is it an issue linked to vLLM you think or rather to the astream_chat function ?

hmmm, let me try to reproduce

I’m using the base mistral instruct v2

I think the issue is less model related and more usage related?

This example works (note, this model is not smart lol but it fits on the T4 instance)
https://colab.research.google.com/drive/1h-eHhoKUnkJQ2TTsw4HxOd0f1zVd97Ol?usp=sharing

This works fine but it’s not in async, is it ?

Cause what’s in trying to do is astream_chat not just stream which works just fine !

oh wait what haha

my changes didn't save

one sec

:PSadge:

i had astream_chat lol

Thanks I’ll try this tomorrow !

Plain Text

embedding = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cpu")

llm = OpenAILike(
    model="Qwen/Qwen-1_8B",
    api_base="http://127.0.0.1:8000/v1",
    api_key="fake",
    api_type="fake",
    max_tokens=256,
    temperatue=0.5,
    is_chat_model=True,
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embedding,
)


set_global_service_context(service_context)

chat_engine = index.as_chat_engine(chat_mode="condense_plus_context")

response = await chat_engine.astream_chat("What did the author do growing up?")
async for token in response.async_response_gen():
  print(token, end="", flush=True)

This seems to work :eyesshaking:

I’ve tried the notebook but it’s really laggy when I use async, I don’t understand why, it takes between 45 seconds to 3 minutes to generate the answer but when I just stream in sync it takes less than 10 seconds

Why would the async make it so much slower

its slower in the notebook? or in your actual code?

I don't really know either -- its just wrapping the openai client 🤔 so theres not really any crazy code going on (and openai is by far our most tested module)

I wonder if its related to this https://github.com/run-llama/llama_index/pull/10339 -- I should probably merge this haha

But thats addressing stream, not astream 🤔

In the notebook, in my code it doesn’t run at all.

The bug seems to be only with local llm though, I haven’t got my hand on an OpenAI api key but I might try later to see if it works fine (which should be since it works for the sec example)

Hello @Logan M ,
I’ve updated my version of llama index, but the pr does not seem to fix anything for me :/
I’m not sure if the issue is my utilisation of the fonctions or an issue between vLLM and llama index.

I really wish I knew what was wrong, but I really don't.

If i scrape the corners of my debugging brain
a) when you run async, is another blocking process taking over when you await? That would explain why non-async is faster, since it doesn't give up control of the event loop
b) are you hosting vllm on the same machine you are running llama-index on? There could be some weird interaction there maybe?

If you use the llm directly, does it reproduce?

Plain Text

resp = await llm.astream_complete("Tell me a poem about raining cats and dogs")
for r in resp:
  print(r.delta, end="", flush=True)

a) no there’s no other blocking process. It works fine in async when the stream is off. Speed is equal when I stream or use async, just slower when I use both.
B) Yes, but I can try to host vLLM on my gpu and use another machine for querying.
c) when using the llm directly it work perfectly. It’s just when using a chat_engine.

My next step would be to fork the repo, and make some changes to the OpenAI class directly to allow vLLM connexion. Right now there is a check for the model name thst prevents this. Hopefully this would solve the issue, otherwise I’ve got no idea where the problem could be

Oh, if you use the OpenAILike class and pass in model name, it works fine.

Hmm, so if it works fine with the LLM directly, then yea the issue is with the chat engine. I suspect its likely related to how writing to chat history and streaming interact

I wish I had more time to debug this, its been a wild few weeks. We are merging/releasing a 2 million line PR today lol (v0.10.0)

I was thinking maybe the problem was coming from this class since it seems to work fine with the OpenAI class.

Okay no worries, I can create an issue on GitHub if you want so you can check it later

If you check out the source code, its a super light wrapper. I don't think it would be causing the issues we are seeing

Yea a github issue works too. And if you find out anything else from debugging, let me know as well 🙂

I agree, but it was the only explanation I could find as to why it works with OpenAI API and not vLLM API 😅

Add a reply

Sign up and join the conversation on Discord