Hello. I've looked in the docs and asked

At a glance

Hello. I've looked in the docs and asked kapa but I couldn't find this answer. I am interested in using QueryPipeline to encapsulate the entire RAG chain in one object. Is it possible to stream the final LLM answer as a result of QueryPipeline().run()? I.e yield a generator from which chunks can be retrieved?

So far I only managed to do it by removing LLM from the pipeline (it only retrieves docs and prepares a prompt) and call LLM separately, like:

Plain Text

output, intermediates = pipeline.run_with_intermediates(
    input="Who is a good boy?"
)
response = llm_azure.stream_chat(messages=output)
for chunk in response:
    print(chunk.delta, end="", flush=True)

4 comments

LLogan M

Need to specify the LLM as a streaming component
https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline/?h=query+pipeline+stream#streaming-support

LLogan M

Although... I hope the intermediates thing doesn't exhaust the stream before you get it 😅

AAlex

Cool, it works thanks a lot! Didn't occur to me to go to examples, normally I look at documentation/API for those kind of things

LLogan M

In a library that moves this fast, probaly examples is your best bet 😅

The API reference has gotten a lot better lately, but still not perfect

Add a reply

Find answers from the community

Hello. I've looked in the docs and asked