Find answers from the community

Updated 3 months ago

Langchain

u seem to be playing on both fields (Langchain and Llama Index). I'm trying to stream output from an agent that uses ChatOpenAI as an LLM , and send it back as a Stream from FastAPI. There's no guide on how to do this. I've found some trails on GitHub on ways to do it with Streamlit but none so with FastAPI.

https://github.com/hwchase17/chat-langchain/issues/39

I suppose most ppl are building with FastAPI - LangChain - Next.js stack. Have u come across anything similar. I'm having to dig into the Langchain codebase, any pointers will be helpful on streaming.

Not sure, its also throwing

Plain Text
from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler
llm = ChatOpenAI(
        model='gpt-3.5-turbo',
        streaming=True, callbacks=[FinalStreamingStdOutCallbackHandler()], temperature=0
)


Plain Text
raise OutputParserException(f"Could not parse LLM output: `{text}`")


It seems like I need to pass in prefix_tokens, but there's literally no elaborate guide on how to do deal with streaming.
L
H
19 comments
Lol I actually work at LlamaIndex, but I do know a thing here and there about lanchain

Streaming is a little annoying. I'm pretty sure fast-api works best with a generator object for streaming text

In llama-index, we create a generator from langchain like this here, using a thread:
https://github.com/jerryjliu/llama_index/blob/d394ffd5b57b976192f002f52fc9315401b4aa09/llama_index/llm_predictor/base.py#L261

That error at the end is pretty common for langchain. It's nothing you did wrong I think, just the LLM not following instructions and breaking the parsing in langchain
Man ur awesome. Let me check this out. Seems straightforward. Only if I can get generator I can stream it back from FastAPI. Thanks.
Yea shouldn't be too bad, good luck!
@Logan M I've gotten the streaming to work. Wasn't hard. I'm using Vercel's AI SDK to deal with printing stream on frontend. I can easily get the stream out from generator, but the problem is its streaming entire agent's thought as well.

Is there a way to put a check when it yields new tokens to know if it's agent's last output?
Attachment
Screenshot_2023-07-01_at_4.51.03_AM.png
Yea, thats the annoying part with streaming langchain agents lol

The final output starts with a specific prefix right? Something like "AI:"?

You could detect that sub string and only start yielding once you have that?

This is an extremely bad, probably broken example, but it was the first thing that came to mind lol

Plain Text
start = False
while True:
  token_buffer = ""
  if not self._token_queue.empty():
    if not start:
      token = self._token_queue.get_nowait()
      token_buffer += token
      if token_buffer.endswith("AI:"):
        start = True
        yield "AI:"
    else:
      token = self._token_queue.get_nowait()
      yield token
Neat. Got it to work.
Attachment
Screenshot_2023-07-01_at_5.24.26_AM.png
Wow! It actually worked! πŸ˜† πŸ‘
Generally, a modified version of StreamingGeneratorCallbackHandler from Lllama Index does the job. This could be included in Llama Index or perhaps Langchain.

Streaming is very annoying lol. Just so if some one else comes across this, processing the prompt through the agent, we can get generator object which can be returned back from FastAPI. Solution was neat but was hairy to figure out.
Attachments
Screenshot_2023-07-01_at_5.27.28_AM.png
Screenshot_2023-07-01_at_5.28.07_AM.png
@Logan M Sorry there's another issue. Not sure if u can tell me some clues to fix the issue.

While the inference is going on the buffer, there's a sudden cut off when it queries the serpapi or uses a tool. Sometimes it goes through fine, but sometimes, there's a cut off. Basially self._done.is_set() goes true. Is this a known issue or there's a workaround?
Attachments
Screenshot_2023-07-01_at_12.28.09_PM.png
Screenshot_2023-07-01_at_12.29.20_PM.png
We are in uncharted territory here haha I have no idea tbh, especially since you are checking if the queue is empty or not πŸ€”

I'm actually working on adding streaming to the native llama-index agents right now. But it's not using anything from langchain πŸ˜… Weird coincidence
Maybe add a time.sleep() or something before setting is_done πŸ˜† Although that might not work if everything is running on a single thread
Hmm. I see. Its a pain to diagnose streaming in Langchain out of the box. Let me study the code more maybe I'll be able to figure out whats happening. Do you think its got something to do with how Langchain turns this off, when making another call? Like it sometimes works, and sometimes doesn't.

If I can keep the stream open while querying a tool it'll be good. Not sure where the issue might be residing. Tools are all tool = LlamaIndexTool.from_tool_config(tool_config) that I'm using.

---

Or maybe a good pointer would be if I can manually turn off the stream once agents completes an answer.
Yea since you are using callbacks, you should be able to set is_done outside of the handler, once the agent returns πŸ‘€
Voila! At the end, every problem is bound to be solved lol.

So simple was the answer. Hopefully it doesn't break in some other way.
Attachment
Screenshot_2023-07-01_at_4.01.33_PM.png
I had to modify the callback that comes out of the box from LangChain. Read in the tokens. Probably, it was just a matter of understanding how its working. Got that out of the way.
Nice, thanks for sharing that! πŸ’ͺ
Add a reply
Sign up and join the conversation on Discord