I'm using llama index to build a question answering bot based on my private knowledge base(a bunch of prepared question-answer pairs in csv file). the knowledge base is embedded into llama index's vectorStore. for now everything works well, except the latencies due to LLM api calls. I want to improve in this way: when user asks questions, I want the bot search the vectorStore first, if there is a good match, return directly the matched answer without turn to LLM; if there isn't a good match, then turn to LLM for an answer. the aim is to reduce unnecessary calls to LLM. does anyone knows how to do this? thank a ton!
You can set response_mode to no_text to only fetch the similar documents without doing a LLM call. Obviously those chunks will be pretty raw and not nicely formatted like a LLM response. There is also the similarity filter for the fetched nodes https://gpt-index.readthedocs.io/en/latest/understanding/querying/querying.html