I have a RAG app that I'd like to expose

At a glance

The community member has a RAG (Retrieval Augmented Generation) app and wants to expose an API endpoint that sends the enriched prompt. They assume that by the time the query_engine is created, the prompt template has had the context_str and query_str pulled from the index and replaced in the template. The community members discuss ways to get the completed prompt as a string from the query_engine.

The suggested solutions include:

Setting up a callback handler to catch the LLM (Large Language Model) events and retrieve the filled-in prompt.
Using a MockLLM in the service context, as the prompts are formatted inside the response synthesizer, which is inside the query_engine.

There is no explicitly marked answer, but the community members seem to converge on the idea of using a MockLLM and a callback to retrieve the filled-in prompt.

Useful resources

IIan Kelk

I have a RAG app that I'd like to expose an API endpoint which just sends the enriched prompt. I'm assuming that by the point the query_engine has been created, the prompt template has had the context_str and query_str pulled from the index and replaced in the template. Is there any way to get the completed prompt as a string from the query_engine?

7 comments

LLogan M

hmm, I think the only way to get the prompt (Filled in) is to setup a callback handler to catch the LLM events

LLogan M

A very simple example (its just printing to the terminal)
https://github.com/run-llama/llama_index/blob/main/llama_index/callbacks/simple_llm_handler.py

IIan Kelk

Thanks! I'll check the docs for how to set up the callback. It's going to be similar in complexity since it just has to send back the text to a POST request in FastApi 🙂

IIan Kelk

@Logan M One question - this example seems to also print the completion. I don't need it to actually called the LLM - just fill in the prompt template with the information. Is it any easier to just get the prompt from the query_engine object?

LLogan M

Hm. Easiest solution is probably setting a MockLLM in the service context

IIan Kelk

Ahh interesting approach! Just to confirm then--the prompt template is only filled in when the query function is called correct? The query_engine doesn't fill it in first right?

So I need to make a MockLLM that returns nothing, and use a callback to return the filled-in prompt?

LLogan M

Yea! We actually have a mock llm built in. Its activated when you set the llm to None

Plain Text

serivce_context = ServiceContext.from_defaults(llm=None)
index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Technically, the prompts are formatted inside the response synthesizer, which is inside the query engine 👍

Add a reply

Find answers from the community

I have a RAG app that I'd like to expose