Find answers from the community

Updated 2 months ago

Streaming chat

Got a few questions:
1) is there anything special I need to do to go from as_query_engine call to a as_chat_engine call?
meaning:
Plain Text
query_engine = index.as_query_engine(
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=threshold_cutoff,percentile_cutoff=percentile_cutoff)],
        retriever_mode="embedding",
        service_context=service_context,
        similarity_top_k=similarity_top_k,
        streaming=True,
        text_qa_template=qa_template
    )
if I just change that to .as_chat_engine will all those features work just fine?
2) if I'm setting streaming=True in the above (#1), then why do I need to call .stream_chat instead of .chat? πŸ€” shouldn't it already know that?
3) my coworker attempted to use the class directly:
Plain Text
chat_engine = CondenseQuestionChatEngine.from_defaults(
        query_engine=query_engine, 
        condense_question_prompt=custom_prompt,
        streaming=True
    )
but it is unhappy about the return value from .stream_chat not being iterable (meaning it is not a streaming response) so... is that just not the/a proper way to do that?
L
R
100 comments
Do you mean to stop the chat history from overflowing? The newest version of llama index now has a basic window buffer for the chat history πŸ‘
It's more involved than that but good to know
Alright @Logan M ...
Plain Text
    store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])
    index = VectorStoreIndex.from_vector_store(vector_store=store)
    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=config["temperature"], model=config["model_name"]), num_output=config["num_output"])
    chat_engine = index.as_chat_engine(
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=config["threshold_cutoff"],percentile_cutoff=config["percentile_cutoff"])],
        retriever_mode="embedding",
        service_context = service_context,
        similarity_top_k=config["similarity_top_k"],
        text_qa_template=qa_template,
        streaming=True,
        condense_question_prompt=custom_prompt,
    )
    streaming_response = chat_engine.stream_chat(prompt, chat_history=modified_chat_history)
Plain Text
ValueError: Streaming is not enabled. Please use chat() instead.
How am I supposed to set it up for streaming properly if streaming=True is insufficient? πŸ€”
(btw I think I'm still on 7.4-ish if that matters)
Ok, I have time to look at this now lol give me a few mins
@Rubenator you are missing streaming=True on the LLM definition I think. That error is raised because the query engine isn't returning a streaming response
We didn't have to do that for as_query_index though so.. what's the difference? πŸ€”
Also, I just added streaming=True to the llm, and have the same error
ok, let me make an example first lol
Well, I was able to avoid the error you got. But also seems like the streaming may still be buggy beyond that tbh (at least in the latest version)

Will try to patch in a bit here
What did you do to avoid the error? Or did you just, not encounter it? πŸ€”
I just didn't encounter it πŸ˜… but I was using the latest version
I wish I could patch this now, but I'm out all weekend with family πŸ’€
All good. I'll try updating on Monday and see if ithelps
My coworker says that updating to 0.7.9 did not fix the error, although I will double check myself rn
Yeah, same issue ValueError: Streaming is not enabled. Please use chat() instead
Yea I never did hit that error, which is a little weird.

But also that reminds me, I need to fix this in general (the streaming for condense question engine is still borked, besides this issue)
@Logan M while you're at it, small feature request... we'd like to be able to grab the request and response that the condense question engine does to condense the question (primarilly for token usage tracking). And, on a similar vein.. being able to directly grab token usage data from the openAI requests in general would be nice (no rush though, it is just a potential nice-to-have).
Have you tried using the token counting callback handler?

https://gpt-index.readthedocs.io/en/latest/examples/callbacks/TokenCountingHandler.html

If you set a global service context, it should not only track the tokens for the condense question step, but also track the inputs and outputs
okay sweet, guess I didn't find that ty
New release cut that worked fine for me. Hope it works well for you!

You shouldn't need to set streaming=True anywhere now
Things aren't acting fine, but I'm checking some stuff -- but, seems like you changed some more stuff:
Plain Text
File "/root/pytest/venv/lib/python3.10/site-packages/llama_index/indices/base.py", line 389, in as_chat_engine
    return OpenAIAgent.from_tools(
TypeError: OpenAIAgent.from_tools() got an unexpected keyword argument 'node_postprocessors'
yeah, this gripe is happening for a majority of the arguments we are passing in to as_chat_engine -- where are they supposed to go instead?
I actually didn't change anything, a colleague fixed some stuff for streaming with condense engine.

The kwargs you are passing in will work for condense question chat mode I think, but for other modes I can see this being an issue due to kwarg abuse in general.

Workaround here is either a) setting chat_mode="condense_question" or b) just creating the agent yourself, rather than using as_chat_engine
where exactly would we set the chat_mode?
index.as_chat_engine(chat_mode="condense_question")
The default chat mode changed to agents (since that's generally a better user experience tbh)
Oh so, it has to be the very first argument
yea, due to the function api

Plain Text
 def as_chat_engine(
        self, chat_mode: ChatMode = ChatMode.BEST, **kwargs: Any
    ) -> BaseChatEngine:
okay great thank you πŸ™‚
Actually... tried that... still the same error about streaming not being enabled πŸ€”
Plain Text
    store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])
    index = VectorStoreIndex.from_vector_store(vector_store=store)
    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=config["temperature"], model=config["model_name"]), num_output=config["num_output"])
    chat_engine = index.as_chat_engine(chat_mode="condense_question",
        node_postprocessors=[SentenceEmbeddingOptimizer(threshold_cutoff=config["threshold_cutoff"],percentile_cutoff=config["percentile_cutoff"])],
        retriever_mode="embedding",
        service_context = service_context,
        similarity_top_k=config["similarity_top_k"],
        text_qa_template=qa_template,
        condense_question_prompt=custom_prompt,
    )
    streaming_response = chat_engine.stream_chat(prompt)
@Logan M this looks kinda suspect:
edit: oops wrong function haha
still getting the error nonetheless
Are you sure you upgraded? I feel like this is impossible haha
Can you try a more slimmed down example? Is it something to do with all the kwargs?

Personally, this notebook runs perfectly for me locally (streaming example at the bottom)

https://github.com/jerryjliu/llama_index/blob/6d44fe02bab6f6104b59dba095828388f009722f/docs/examples/chat_engine/chat_engine_condense_question.ipynb
There must be some difference we aren't seeing
Plain Text
    service_context = ServiceContext.from_defaults(llm=OpenAI(temperature=0, model=config["model_name"]))
    chat_engine = index.as_chat_engine(chat_mode="condense_question", service_context = service_context)
this fails^
this does not:
Plain Text
    service_context = ServiceContext.from_defaults(llm=OpenAI(model=config["model_name"]))
    chat_engine = index.as_chat_engine(chat_mode="condense_question", service_context = service_context)
ok let me try lol
oh sorry @Logan M -- the as_chat_engine also has service_context=service_context in both xD
This worked for me just now
Attachments
image.png
image.png
oh wait one sec
wrong chat mode lol
Hmm I'm still not able to replicate the original error

I think I was duped by this in my testing last night though, the streaming just hangs :PSadge: back to the grindstone lol
you're also using a different model -- we're on "gpt-3.5-turbo"
Just changed it, same result -- streaming just hangs forever
although I get the same error on the other version
If you can humor me for like 1 sec

Plain Text
cd ~/
python -m venv sanity_env
source sanity_env/bin/activate
pip install llama-index


This env should not have the "streaming not enabled" error (but also, streaming probably will just hang, like mine is)
Nope, same error
I had to install python-dotenv and pymongo in addition to that to run my code but that's all
:PepeHands:
Can you stream a normal query engine response?

Plain Text
response = index.as_query_engine(streaming=True).query("test")
print(type(response))
I thought you said no more streaming=True?
or is that only for chat engine?
Not for as query engine (the interfaces are a little out of alignment)
Yea, since chat engines have specific stream endpoints (there's no stream_query on the query engines yet)
Plain Text
INFO:numexpr.utils:NumExpr defaulting to 6 threads.
NumExpr defaulting to 6 threads.
Generating response
<class 'llama_index.response.schema.Response'>
yup
that's not a streaming response
yes, like the error said ;p
well, narrowing down the issue haha
its not becuase of as_chat_engine
Can I see the code+imports that you have for setting up the service context? I feel like you've shared this before, but just double checking
Hmmm or maybe it's related to using mongodb
Just need to narrow the example down to something simple
lemme just... comment out basically everything
this is the most minimum example I can think of

Plain Text
from llama_index import ListIndex, Document, ServiceContext
from llama_index.llms import OpenAI

index = ListIndex.from_documents([Document.example()], service_context=ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0)))

response = index.as_query_engine(streaming=True).query("test")
print(type(response))
mkay... so... I'm not even using a service context
if I shuffle stuff around so as not to do:
Plain Text
from dotenv import load_dotenv
load_dotenv()
then I get your behavior
but as soon as I put that back in
I get a Response instead of a StreamingResponse
er wait no... but I'm close
Okay @Logan M I have fully narrowed it down -- I get a Response back instead of waiting forever, when the OPENAI_API_KEY Environment variable is set (with or without loadenv)
So here is my min repro:
Plain Text
import os
os.environ["OPENAI_API_KEY"] = "****"
config = {"mongo_uri":'****', "db_name":'****'}
import pymongo
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.indices.vector_store.base import VectorStoreIndex

db = pymongo.MongoClient(config["mongo_uri"])[config["db_name"]]
store = MongoDBAtlasVectorSearch(db)
index = VectorStoreIndex.from_vector_store(vector_store=store)
response = index.as_query_engine(streaming=True).query("test")
print(type(response))
Does it re-produce without the mongodb too?
Oh and, the key must be valid... when it is not valid, it does the wait forever
No -- when I use ListIndex(Document.example()], --etc I get a StreamingResponse back
Let me know if you are able to reproduce or not ^_^;
ah, so it's possibly related to mongodb then πŸ€” Hmmm
We've been trying all sorts of things over here... any thoughts on what it could be? πŸ€”
(noticed some of the fixes in the latest changelogs but, still seeting same issue)
@Logan M just figured it out
Plain Text
store = MongoDBAtlasVectorSearch(get_db(), db_name=config["db_name"],collection_name=config["collection_name"], index_name=config["index_name"])

Due to some minor refactoring, our get_db function was returning a mongodb['db_name_here'] instead of just mongodb.
This caused 0 nodes to get returned... but at no point was that fact caught.
And then it ultimately results in the response being a None type, which results in a very boring and empty Response getting created and returned instead of a streaming response:
Attachment
image.png
So, proposed solution would be to:
A) throw an error if that first MongoDBAtlasVectorSearch argument is not a database object instance
B) throw an error (or something like that) if the query returns no nodes (rather than letting it get past all the string checks)
it is actually rather miraculous that it makes it all the way to returning a response at all but, ultimately it is just due to these sorts of things not getting checked.
Anyway, with that out of the way...
we are now suprised to see:
TypeError: 'StreamingAgentChatResponse' object is not iterable
Is that intentionally not iterable? πŸ€”
yea, should use response.response_gen to get the iterator

The response has other things on it, like sources, which gives you access to the raw query engine response under the good
Definitely would be an easy PR to make if you had the bandwidth! Great detective skills here 🧠
Sure I'm down once we get out thing out the door next week, but I'll probably ask for some help double checking that I'm doing the checks in the correct places -- there's a lot of class inheritance happening πŸ˜…
Thankfully all the vector store stuff is inside a single file (i.e. theres one file for each vectordb integration)!

Here's the mongo file https://github.com/jerryjliu/llama_index/blob/main/llama_index/vector_stores/mongodb.py
Add a reply
Sign up and join the conversation on Discord