Long Context RAG: is RAG dead ?

At a glance

appreciated 💪 I think this topic already came to mind before in a talk by Lance from LangChain, summary:
(https://www.youtube.com/watch?v=UlmyyYQGhzc)

Context lengths for LLMs are increasing, raising questions about the necessity of external retrieval systems like RAG, especially when massive amounts of context can be fed directly into LLMs.

Greg Kamradt's Needle in A Haystack analysis tested LLMs' ability to retrieve specific facts from varying context lengths and placements within documents, revealing limitations in retrieval, particularly towards the start of longer documents.
RAG systems aim for multi-fact retrieval, requiring the retrieval of multiple facts from a context. Google's recent 100-needle retrieval demonstrates the need for efficient multi-needle retrieval for comprehensive understanding.
Retrieval from long contexts doesn't guarantee retrieval of multiple facts, especially with increasing context size and number of needles.
Cost of long-context tests can be managed effectively, with careful budgeting enabling meaningful research without significant financial strain.

Limitations for Longer Context:

no retrieval guaranties, multiple facts are not guaranteed to be retrieved, especially as the number of needles and context size increases.
GPT4-0 tends to fail near the start of the document-size, less fails on bigger datasets.
Specific prompting is needed for larger contexts.
Performance degrades when the LLM is asked to reason about retrieved facts the longer the context.
Longer Context are pricey, and take longer to generate.

My takes: in the future there is less focus on indexing/chunking and more focus on improving retrieval while reducin hallucinations. DSPY could be interesting for this.

5 comments

RRemo van de Sande

test

RRemo van de Sande

Since black-box LLMs are pretrained on unknown datasets, the leakage of evaluation datasets may occur. Especially, some of the evaluation datasets are based on Wikipedia, which has likely been seen by LLMs during during. In some cases, we find that model may predict the correct answer using exactly the same words as the groundtruth (e.g. “meticulously”), even when they do not appear in the provided context. In our experiment, we try mitigating this issue by prompting the model to answer ‘‘based only on the provided passage’’ for both RAG and LC. It remains an open question how to address the data leakage issue in LLM evaluation.

interesting point to be made and why I don't trust most evaluation benchmarks as it does not properly reflect real-life usecases. Most of the datasets used have very cleanly formatted data already or trained on internet data while in reality most clients, enterprises or small business alike, have very "dirty" data. We could benefit from more realistic evaluation metrics.

RRemo van de Sande

Concretely, our method consists of two steps: a RAG-and-Route step and a long-context prediction step. In the first step, we provide the query and the retrieved chunks to the LLM, and prompt it to predict whether the query is answerable and, if so, generate the answer. This is similar to standard RAG, with one key difference: the LLM is given the option to decline answering with the prompt ‘‘Write unanswerable if the query can not be answered based on the provided text’’. For the queries deemed answerable, we accept the RAG prediction as the final answer. For the queries deemed unanswerable, we proceed to the second step, providing the full context to the long-context LLMs to obtain the final prediction (i.e., LC).

Interesting method but I propose 2 different ones, inspired by Lance:

Use Parent-document retrieval where, using meta-data, you can do vector search on meta-data which would fetch all topic related chunks who would than retrieve it's parent document.

This means you would not need to ingest the complete document but you would e.g. ingest the relevant chapter(s) based on the query.

There are multiple ways to implement this method but the bottleneck now seems to be proper way to create reliable meta-data for data.

From my early testing there is no real solution yet for accurate meta-data labeling that is affordable and usable.

mmr.niko.la

Ya I think for 1 pager use cases LC is simple faster and better

But after a certain size LC doesn’t work cuz it has too much content?

I think currently under 30k token is my guesstimate as of aug 2024 but that goal post is constantly moving .

I do t think the 128k LC perform as well but that could change next week 😭

mmr.niko.la

Similar to when instructing an LLM putting the ask in the begging and or ending it follow the instructions much better than putting in the middle.

Similarly retrieving has the same problem….

Add a reply

Find answers from the community

Long Context RAG: is RAG dead ?