Am I supposed to be using stream_with_context
or no?
thats suppose to stream it properly
return Response(stream_with_context(response_stream(message)))
Even doing that it still doesn't return word by word
u know what. i justy realized
its not doing that for me either anymore
I wonder if it was an update?
i just moved everything from sagemaker to ec2 instance
it was working fie in sagemaker
wait i think i broke mine cause of the nginx server
i had to setup nginx in order to get the flask server to run
Ah gotcha, does it work now?
i have to fix it in nginx probably
but first it waits for the whole response stream from the flask and then streams the whole thing but thats pointless
How did you fix it? Mine is still waiting until the whole response is done
mine was probelm with nginx
defualt nginx waits fr whoel response
i just turned it off and it worked
Ahh I see, so yours is in deployment? I'm not too familiar with nginx yet
there was no other way to use it lol
or else the moment i get out of ssh, it shuts off lol
i still need to figure out the css problem which im pretty sure is what im having or the js rpobelm
That makes sense. I'm still so confused why mine is just returning the whole thing and not word by word
Yes but currently I'm just trying to get the backend to work
I'm using next.js for the frontend
for website u need js and setup loop
u need to have a stream reader
But curl is waiting until the whole response is done, it doesn't return it word by word
This is what it looks like
The same thing happens when I just do it in the terminal with curl
for open ai, do u have streaming=true?
yea theres smth u need to do for that to work
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True))
its different than a custom model
I thought the LLMPredictor
is model agnostic?
So wouldn't the configuration still be the same
@Logan M would love to get your thoughts here if possible. Is there any extra configuration I have to set up to get streaming to work if I'm using gpt-3.5-turbo?
im not using llm predictor
im using HuggingFaceLLMPredictor
Assuming you are on a newer version of llama index, streaming works with gpt-3.5
I've tested it and had it working locally 👍
No extra setup besides setting streaming=True in the ChatOpenAI class and as_query_engine call
Do you happen to have the code? To my knowledge, I've configured everything correctly
One sec, let me see if I have something
If u good streaming for llamaindex theres so many examples
>>> from llama_index import SimpleDirectoryReader, GPTVectorStoreIndex, ServiceContext, LLMPredictor
>>> documents = SimpleDirectoryReader("./paul_graham").load_data()
>>> from langchain.chat_models import ChatOpenAI
>>> llm_predictor = LLMPredictor(llm=ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0, streaming=True))
>>> service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)
>>> index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
>>> response = index.as_query_engine(streaming=True).query("what did the author do growing up?")
>>> response.print_response_stream()
The author wrote short stories and tried programming on an IBM 1401 in 9th grade.
I've tried the ones on the website and even those don't work when I do response.print_response_stream()
for example
I'll try running this, thanks!
So weird, for me it doesn't work.. it just waits for the full response
What llama index version do you have?
Sheesh lol and it doesn't work even as a standalone script?
oh wait I lied, it's also not working for me... I thought I just looked away and it printed fast because it was short lol
not even text-davinci-003 is streaming... i swear this worked just the other day. It's like it's streaming entire chunks. Just now it printend 3 paragraphs for a response, but waited in between each paragraph
It’s never us always something else
I FIXEDDD the streamingggg
Were you able to get it to work?
Yea it works now haha. I think it was because I was running in a python shell rather than a script. Pushed a changed yesterday to help with that too, when using print_response_stream()
just ran now and it streams fine 👀
Just tested it with the new update and it works too, thank you!
For my frontend app I want to return bullet points and list the specific sources (timestamps in a doc) of where it found the info. Originally I was planning on using StructuredLLMPredictor to achieve this and return a JSON object. However you can't stream with StructuredLLMPredictor.
For this kind of use case do you still suggest returning a JSON object or am I able to retrieve the specific sources (timestamps within specific docs) another way?
Wooo glad it works now! 🫡💪
Is the info you need available from response.source_nodes? It can maybe be parsed from that list?
Niceee glad to see it works for u now!
I'm testing that out now.. however, chatGPT just keeps throwing "Not enough context is provided to answer this question" or "As an AI language model I cannot determine.." or some variation of that. I provided a lot of context though— to fix this is it just prompt engineering?
Smells like gpt-3.5 to me lol
Is the answer actually in response.source_nodes? If so, I suspect it's the refine prompt causing issues
Ok just got the prompt to work, unfortunately it doesn't list out all the sources of where it found things. If I'm generating JSON to get info like timestamps then do you suggest using StructuredLLMPredictor?