Streaming

At a glance

I am running into problems when trying to work with streamed query response. everything works correctly when using the following code and flask development server:

Plain Text

response = query_engine.query(question)
full_answer = ""
for token in response.response_gen:
    full_answer = full_answer + token
    emit("answer", {"token": token})

however, when using gunicorn and eventlet or gevent worker (need to use one of those because I want to use websockets to be able to stream the response to client), the code hangs at the for loop line and no iteration of the loop is executed. I assume the code needs to written differently to work with the gunicorn workers, does anybody have any experience with this?

8 comments

LLogan M

I haven't used gunicorn, but I know internally we used fastapi for streaming and it worked fine 😅

LLotar

to make sure I am not doing anything wrong, I just tried streaming directly with the OpenAI library in the same environment (as per https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb) and it worked

LLotar

hm... I am pretty new to Python, so perhaps using gunicorn + flask + flask socketio was not the best choice 😄 will try fastapi

LLogan M

I wonder if it's related to the threading in gunicorn conflicting with something in llama-index? hmm

LLotar

I think so too - both main async gunicorn workers, eventlet and gevent are using monkey patching to make everything work and based on some reading I did, it's all very fragile... fastapi looks like much better option, so I will try rewriting our app into that to see if it helps (but hopefully it will, since it worked for you :D)

LLogan M

haha let me know how it goes! I used to use flask a lot too, fastapi is pretty similar but better built/supported, hope it works well 🙏 👍

LLotar

so, just to confirm - with fastapi everything works as expected, thanks for the tip!

LLogan M

ayyy nice! 💪

Add a reply

Find answers from the community

Streaming