Langchain

At a glance

A community member is trying to stream output from an agent that uses ChatOpenAI as an LLM and send it back as a Stream from FastAPI. They have found some examples for Streamlit but not for FastAPI. Another community member who works at LlamaIndex suggests using a generator object for streaming text in FastAPI, and provides a code example from the LlamaIndex codebase.

The community members discuss the challenges of streaming with LangChain, including issues with parsing the LLM output and dealing with the agent's thought process. They share code snippets and suggestions to handle the streaming, such as detecting a specific prefix in the output to start yielding the response. Eventually, the community members are able to get the streaming to work, but encounter further issues with the stream cutting off when querying external tools. They discuss potential workarounds and modifications to the LangChain callbacks to address these problems.

There is no explicitly marked answer, but the community members collaborate to find a solution to the streaming issue, including sharing code examples and insights from their experience.

Useful resources

HHK-RantTest-HarisRashid

u seem to be playing on both fields (Langchain and Llama Index). I'm trying to stream output from an agent that uses ChatOpenAI as an LLM , and send it back as a Stream from FastAPI. There's no guide on how to do this. I've found some trails on GitHub on ways to do it with Streamlit but none so with FastAPI.

https://github.com/hwchase17/chat-langchain/issues/39

I suppose most ppl are building with FastAPI - LangChain - Next.js stack. Have u come across anything similar. I'm having to dig into the Langchain codebase, any pointers will be helpful on streaming.

Not sure, its also throwing

Plain Text

from langchain.callbacks.streaming_stdout_final_only import FinalStreamingStdOutCallbackHandler
llm = ChatOpenAI(
        model='gpt-3.5-turbo',
        streaming=True, callbacks=[FinalStreamingStdOutCallbackHandler()], temperature=0
)

Plain Text

raise OutputParserException(f"Could not parse LLM output: `{text}`")

It seems like I need to pass in prefix_tokens, but there's literally no elaborate guide on how to do deal with streaming.

19 comments

LLogan M

Lol I actually work at LlamaIndex, but I do know a thing here and there about lanchain

Streaming is a little annoying. I'm pretty sure fast-api works best with a generator object for streaming text

In llama-index, we create a generator from langchain like this here, using a thread:
https://github.com/jerryjliu/llama_index/blob/d394ffd5b57b976192f002f52fc9315401b4aa09/llama_index/llm_predictor/base.py#L261

That error at the end is pretty common for langchain. It's nothing you did wrong I think, just the LLM not following instructions and breaking the parsing in langchain

HHK-RantTest-HarisRashid

Man ur awesome. Let me check this out. Seems straightforward. Only if I can get generator I can stream it back from FastAPI. Thanks.

LLogan M

Yea shouldn't be too bad, good luck!

HHK-RantTest-HarisRashid

@Logan M I've gotten the streaming to work. Wasn't hard. I'm using Vercel's AI SDK to deal with printing stream on frontend. I can easily get the stream out from generator, but the problem is its streaming entire agent's thought as well.

Is there a way to put a check when it yields new tokens to know if it's agent's last output?

Attachment

LLogan M

Yea, thats the annoying part with streaming langchain agents lol

The final output starts with a specific prefix right? Something like "AI:"?

You could detect that sub string and only start yielding once you have that?

This is an extremely bad, probably broken example, but it was the first thing that came to mind lol

Plain Text

start = False
while True:
  token_buffer = ""
  if not self._token_queue.empty():
    if not start:
      token = self._token_queue.get_nowait()
      token_buffer += token
      if token_buffer.endswith("AI:"):
        start = True
        yield "AI:"
    else:
      token = self._token_queue.get_nowait()
      yield token

HHK-RantTest-HarisRashid

Attachment

HHK-RantTest-HarisRashid

Neat. Got it to work.

Attachment

LLogan M

Wow! It actually worked! 😆 👏

LLogan M

Nice!

HHK-RantTest-HarisRashid

Generally, a modified version of StreamingGeneratorCallbackHandler from Lllama Index does the job. This could be included in Llama Index or perhaps Langchain.

Streaming is very annoying lol. Just so if some one else comes across this, processing the prompt through the agent, we can get generator object which can be returned back from FastAPI. Solution was neat but was hairy to figure out.

Attachments

HHK-RantTest-HarisRashid

@Logan M Sorry there's another issue. Not sure if u can tell me some clues to fix the issue.

While the inference is going on the buffer, there's a sudden cut off when it queries the serpapi or uses a tool. Sometimes it goes through fine, but sometimes, there's a cut off. Basially self._done.is_set() goes true. Is this a known issue or there's a workaround?

Attachments

Screenshot_2023-07-01_at_12.28.09_PM.png

Screenshot_2023-07-01_at_12.29.20_PM.png

LLogan M

We are in uncharted territory here haha I have no idea tbh, especially since you are checking if the queue is empty or not 🤔

I'm actually working on adding streaming to the native llama-index agents right now. But it's not using anything from langchain 😅 Weird coincidence

LLogan M

Maybe add a time.sleep() or something before setting is_done 😆 Although that might not work if everything is running on a single thread

HHK-RantTest-HarisRashid

Hmm. I see. Its a pain to diagnose streaming in Langchain out of the box. Let me study the code more maybe I'll be able to figure out whats happening. Do you think its got something to do with how Langchain turns this off, when making another call? Like it sometimes works, and sometimes doesn't.

If I can keep the stream open while querying a tool it'll be good. Not sure where the issue might be residing. Tools are all tool = LlamaIndexTool.from_tool_config(tool_config) that I'm using.

---

Or maybe a good pointer would be if I can manually turn off the stream once agents completes an answer.

LLogan M

Yea since you are using callbacks, you should be able to set is_done outside of the handler, once the agent returns 👀

HHK-RantTest-HarisRashid

Voila! At the end, every problem is bound to be solved lol.

So simple was the answer. Hopefully it doesn't break in some other way.

Attachment

HHK-RantTest-HarisRashid

https://gist.github.com/harisrab/8fc67827ebf3acb997398b5252869351

HHK-RantTest-HarisRashid

I had to modify the callback that comes out of the box from LangChain. Read in the tokens. Probably, it was just a matter of understanding how its working. Got that out of the way.

LLogan M

Nice, thanks for sharing that! 💪

Add a reply

Find answers from the community

Langchain