LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

Streaming chat

Streaming chat

At a glance

The post discusses several questions related to using the as_query_engine and as_chat_engine calls in the LlamaIndex library. The main points are:

1) Community members are unsure if all the features used in the as_query_engine call will work when switching to as_chat_engine.

2) Community members are confused about the need to call stream_chat instead of chat when streaming=True is set.

3) A community member's coworker tried using the CondenseQuestionChatEngine class directly, but encountered an issue with the return value from stream_chat not being iterable.

The comments discuss various attempts to resolve the streaming issues, including trying different versions of the library, adjusting the service_context and chat_mode settings, and using a token counting callback handler. However, there does not appear to be a clear, definitive answer provided in the comments.

Useful resources

·

Got a few questions:
1) is there anything special I need to do to go from as_query_engine call to a as_chat_engine call?
meaning:

Plain Text

query_engine = index.as_query_engine(
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=threshold_cutoff,percentile_cutoff=percentile_cutoff)],
        retriever_mode="embedding",
        service_context=service_context,
        similarity_top_k=similarity_top_k,
        streaming=True,
        text_qa_template=qa_template
    )

if I just change that to .as_chat_engine will all those features work just fine?
2) if I'm setting streaming=True in the above (#1), then why do I need to call .stream_chat instead of .chat? 🤔 shouldn't it already know that?
3) my coworker attempted to use the class directly:

Plain Text

chat_engine = CondenseQuestionChatEngine.from_defaults(
        query_engine=query_engine, 
        condense_question_prompt=custom_prompt,
        streaming=True
    )

but it is unhappy about the return value from .stream_chat not being iterable (meaning it is not a streaming response) so... is that just not the/a proper way to do that?

L

R

100 comments

Do you mean to stop the chat history from overflowing? The newest version of llama index now has a basic window buffer for the chat history 👍

It's more involved than that but good to know

Alright @Logan M ...

Plain Text

    store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])
    index = VectorStoreIndex.from_vector_store(vector_store=store)
    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=config["temperature"], model=config["model_name"]), num_output=config["num_output"])
    chat_engine = index.as_chat_engine(
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=config["threshold_cutoff"],percentile_cutoff=config["percentile_cutoff"])],
        retriever_mode="embedding",
        service_context = service_context,
        similarity_top_k=config["similarity_top_k"],
        text_qa_template=qa_template,
        streaming=True,
        condense_question_prompt=custom_prompt,
    )
    streaming_response = chat_engine.stream_chat(prompt, chat_history=modified_chat_history)

Plain Text

ValueError: Streaming is not enabled. Please use chat() instead.

How am I supposed to set it up for streaming properly if streaming=True is insufficient? 🤔
(btw I think I'm still on 7.4-ish if that matters)

Ok, I have time to look at this now lol give me a few mins

@Rubenator you are missing streaming=True on the LLM definition I think. That error is raised because the query engine isn't returning a streaming response

We didn't have to do that for as_query_index though so.. what's the difference? 🤔

Also, I just added streaming=True to the llm, and have the same error

ok, let me make an example first lol

Well, I was able to avoid the error you got. But also seems like the streaming may still be buggy beyond that tbh (at least in the latest version)

Will try to patch in a bit here

What did you do to avoid the error? Or did you just, not encounter it? 🤔

I just didn't encounter it 😅 but I was using the latest version

I wish I could patch this now, but I'm out all weekend with family 💀

All good. I'll try updating on Monday and see if ithelps

My coworker says that updating to 0.7.9 did not fix the error, although I will double check myself rn

Yeah, same issue ValueError: Streaming is not enabled. Please use chat() instead

Yea I never did hit that error, which is a little weird.

But also that reminds me, I need to fix this in general (the streaming for condense question engine is still borked, besides this issue)

@Logan M while you're at it, small feature request... we'd like to be able to grab the request and response that the condense question engine does to condense the question (primarilly for token usage tracking). And, on a similar vein.. being able to directly grab token usage data from the openAI requests in general would be nice (no rush though, it is just a potential nice-to-have).

Have you tried using the token counting callback handler?

https://gpt-index.readthedocs.io/en/latest/examples/callbacks/TokenCountingHandler.html

If you set a global service context, it should not only track the tokens for the condense question step, but also track the inputs and outputs

okay sweet, guess I didn't find that ty

New release cut that worked fine for me. Hope it works well for you!

You shouldn't need to set streaming=True anywhere now

Things aren't acting fine, but I'm checking some stuff -- but, seems like you changed some more stuff:

Plain Text

File "/root/pytest/venv/lib/python3.10/site-packages/llama_index/indices/base.py", line 389, in as_chat_engine
    return OpenAIAgent.from_tools(
TypeError: OpenAIAgent.from_tools() got an unexpected keyword argument 'node_postprocessors'

yeah, this gripe is happening for a majority of the arguments we are passing in to as_chat_engine -- where are they supposed to go instead?

I actually didn't change anything, a colleague fixed some stuff for streaming with condense engine.

The kwargs you are passing in will work for condense question chat mode I think, but for other modes I can see this being an issue due to kwarg abuse in general.

Workaround here is either a) setting chat_mode="condense_question" or b) just creating the agent yourself, rather than using as_chat_engine

where exactly would we set the chat_mode?

index.as_chat_engine(chat_mode="condense_question")

The default chat mode changed to agents (since that's generally a better user experience tbh)

Oh so, it has to be the very first argument

yea, due to the function api

Plain Text

 def as_chat_engine(
        self, chat_mode: ChatMode = ChatMode.BEST, **kwargs: Any
    ) -> BaseChatEngine:

okay great thank you 🙂

Actually... tried that... still the same error about streaming not being enabled 🤔

Plain Text

    store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])
    index = VectorStoreIndex.from_vector_store(vector_store=store)
    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=config["temperature"], model=config["model_name"]), num_output=config["num_output"])
    chat_engine = index.as_chat_engine(chat_mode="condense_question",
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=config["threshold_cutoff"],percentile_cutoff=config["percentile_cutoff"])],
        retriever_mode="embedding",
        service_context = service_context,
        similarity_top_k=config["similarity_top_k"],
        text_qa_template=qa_template,
        condense_question_prompt=custom_prompt,
    )
    streaming_response = chat_engine.stream_chat(prompt)

0.7.10

@Logan M ~~this looks kinda suspect~~:

edit: oops wrong function haha

but

still getting the error nonetheless

Are you sure you upgraded? I feel like this is impossible haha

I'm positive

Can you try a more slimmed down example? Is it something to do with all the kwargs?

Personally, this notebook runs perfectly for me locally (streaming example at the bottom)

https://github.com/jerryjliu/llama_index/blob/6d44fe02bab6f6104b59dba095828388f009722f/docs/examples/chat_engine/chat_engine_condense_question.ipynb

There must be some difference we aren't seeing

Plain Text

    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0, model=config["model_name"]))
    chat_engine = index.as_chat_engine(chat_mode="condense_question", service_context = service_context)

this fails^
this does not:

Plain Text

    service_context = ServiceContext.from_defaults(llm=OpenAI(model=config["model_name"]))
    chat_engine = index.as_chat_engine(chat_mode="condense_question", service_context = service_context)

wtfffff

🤯

ok let me try lol

🤣

oh sorry @Logan M -- the as_chat_engine also has service_context=service_context in both xD

fixed above

This worked for me just now

Attachments

oh wait one sec

wrong chat mode lol

Hmm I'm still not able to replicate the original error

I think I was duped by this in my testing last night though, the streaming just hangs :PSadge: back to the grindstone lol

you're also using a different model -- we're on "gpt-3.5-turbo"

Just changed it, same result -- streaming just hangs forever

although I get the same error on the other version

If you can humor me for like 1 sec

Plain Text

cd ~/
python -m venv sanity_env
source sanity_env/bin/activate
pip install llama-index

This env should not have the "streaming not enabled" error (but also, streaming probably will just hang, like mine is)

Nope, same error

I had to install python-dotenv and pymongo in addition to that to run my code but that's all

:PepeHands:
Can you stream a normal query engine response?

Plain Text

response = index.as_query_engine(streaming=True).query("test")
print(type(response))

I thought you said no more streaming=True?

or is that only for chat engine?

Not for as query engine (the interfaces are a little out of alignment)

Yea, since chat engines have specific stream endpoints (there's no stream_query on the query engines yet)

Plain Text

INFO:numexpr.utils:NumExpr defaulting to 6 threads.
NumExpr defaulting to 6 threads.
Generating response
<class 'llama_index.response.schema.Response'>

yup

aha!

that's not a streaming response

yes, like the error said ;p

well, narrowing down the issue haha

its not becuase of as_chat_engine

Can I see the code+imports that you have for setting up the service context? I feel like you've shared this before, but just double checking

Hmmm or maybe it's related to using mongodb

Just need to narrow the example down to something simple

lemme just... comment out basically everything

this is the most minimum example I can think of

Plain Text

from llama_index import ListIndex, Document, ServiceContext
from llama_index.llms import OpenAI

index = ListIndex.from_documents([Document.example()], service_context=ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0)))

response = index.as_query_engine(streaming=True).query("test")
print(type(response))

mkay... so... I'm not even using a service context

but

if I shuffle stuff around so as not to do:

Plain Text

from dotenv import load_dotenv
load_dotenv()

then I get your behavior

but as soon as I put that back in

I get a Response instead of a StreamingResponse

er wait no... but I'm close

oh okay yes

Okay @Logan M I have fully narrowed it down -- I get a Response back instead of waiting forever, when the OPENAI_API_KEY Environment variable is set (with or without loadenv)

So here is my min repro:

Plain Text

import os
os.environ["OPENAI_API_KEY"] = "****"
config = {"mongo_uri":'****', "db_name":'****'}
import pymongo
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.indices.vector_store.base import VectorStoreIndex

db = pymongo.MongoClient(config["mongo_uri"])[config["db_name"]]
store = MongoDBAtlasVectorSearch(db)
index = VectorStoreIndex.from_vector_store(vector_store=store)
response = index.as_query_engine(streaming=True).query("test")
print(type(response))

Does it re-produce without the mongodb too?

Oh and, the key must be valid... when it is not valid, it does the wait forever

No -- when I use ListIndex(Document.example()], --etc I get a StreamingResponse back

Let me know if you are able to reproduce or not ^_^;

ah, so it's possibly related to mongodb then 🤔 Hmmm

We've been trying all sorts of things over here... any thoughts on what it could be? 🤔

(noticed some of the fixes in the latest changelogs but, still seeting same issue)

@Logan M just figured it out

Plain Text

store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])

Due to some minor refactoring, our get_db function was returning a mongodb['db_name_here'] instead of just mongodb.
This caused 0 nodes to get returned... but at no point was that fact caught.
And then it ultimately results in the response being a None type, which results in a very boring and empty Response getting created and returned instead of a streaming response:

Attachment

So, proposed solution would be to:
A) throw an error if that first MongoDBAtlasVectorSearch argument is not a database object instance
B) throw an error (or something like that) if the query returns no nodes (rather than letting it get past all the string checks)

it is actually rather miraculous that it makes it all the way to returning a response at all but, ultimately it is just due to these sorts of things not getting checked.

Anyway, with that out of the way...
we are now suprised to see:
TypeError: 'StreamingAgentChatResponse' object is not iterable
Is that intentionally not iterable? 🤔

yea, should use response.response_gen to get the iterator

The response has other things on it, like sources, which gives you access to the raw query engine response under the good

okay cool ty

Definitely would be an easy PR to make if you had the bandwidth! Great detective skills here 🧠

Sure I'm down once we get out thing out the door next week, but I'll probably ask for some help double checking that I'm doing the checks in the correct places -- there's a lot of class inheritance happening 😅

Thankfully all the vector store stuff is inside a single file (i.e. theres one file for each vectordb integration)!

Here's the mongo file https://github.com/jerryjliu/llama_index/blob/main/llama_index/vector_stores/mongodb.py

Add a reply

Sign up and join the conversation on Discord