----------------------
To use a streaming response with LlamaIndex, you need to ensure that both your LLM and your query engine are configured to support streaming.
- Configure an LLM that supports streaming: You need to use an LLM that supports streaming and set
streaming=True
. Currently, streaming is supported by OpenAI and HuggingFace LLMs. Here is an example of how to set up an LLM with streaming:
llm_predictor = LLMPredictor(
llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo", streaming=True)
)
service_context = ServiceContext.from_defaults(
llm_predictor=llm_predictor
)
Source ()
- Configure your query engine to use streaming: If you are using the high-level API, set
streaming=True
when building a query engine. Here is an example:
query_engine = index.as_query_engine(
streaming=True,
similarity_top_k=1
)
Source ()
If you are using the low-level API to compose the query engine, pass
streaming=True
when constructing the
ResponseSynthesizer
:
synth = ResponseSynthesizer.from_args(streaming=True, ...)
query_engine = RetrieverQueryEngine(response_synthesizer=synth, ...)
Source ()
After setting up the streaming, you can use the
query
method of your query engine to get a streaming response:
response_stream = query_engine.query("Your query here")
And then print the streaming response:
response_stream.print_response_stream()
Source ()