🌟 With the continuous expansion of context windows, is RAG useful?
👉 As we've observed Claude 3, Gemini 1.5, and GPT-4 demonstrating exceptional performance on the needle-in-the-haystack benchmark, it begs the question: with the ever-increasing context windows, is there still a need for RAG, or should we stuff all our information directly into the LLM's context?
The succinct answer is a resounding "NO," and here's why.
🔍📚 Let's delve into what the needle-in-the-haystack benchmark entails. It evaluates how efficiently an LLM can identify a single piece of information (the needle) within a vast expanse of unrelated text (the haystack). However, this benchmark is rather artificial and doesn't reflect the real-world application of LLMs in production environments.
Despite LLMs excelling at these benchmarks, they often overlook crucial facts when tasked with synthesizing multiple pieces of information into a cohesive answer.
📈🧩 Adding more content to the context window only exacerbates the issue. By surrounding relevant facts with irrelevant ones, the likelihood of missing important information increases significantly.
🌧️💡 Fortunately, the team at LangChain has revised the benchmark to assess a more realistic scenario, revealing some concerning results. The outlook is indeed grim...
🔄🏆 As LLM developers continue to improve their models, we can anticipate them mastering this new benchmark in future iterations. However, as soon as one benchmark is conquered, new ones will emerge that highlight different weaknesses. This iterative process of overcoming challenges is an ongoing cycle.
🧂💭 The takeaway here is to approach benchmarks with a healthy dose of skepticism. They are far from providing a comprehensive understanding of the complexities and nuances of real-world applications.
❓ Is RAG Really Dead? Testing Multi Fact Retrieval & Reasoning in GPT4-128k ⬇
One of the most popular benchmarks for long context LLM retrieval is Gregory Kamradt's Needle in A Haystack.
We extended Greg's repo so that you can place many needles in the context and tested GPT-4-128k.
📹 Short video (more detail below):
https://lnkd.in/g4GYuN6Y
---
Most Needle in A Haystack analyses to date have only evaluated a single needle. But, RAG is often focused on retrieving multiple facts & reasoning about them.
To replace RAG, long context LLMs need to retrieve & reason about multiple facts in the prompt.
To test this, we recently updated Greg's repo to work with multi-needle and use LangSmith for evaluation.
We tested GPT-4-128k on retrieval of 1, 3, and 10 needles in a single turn across 1k to 120k context windows.
We found that performance degrades:
1/ As you ask LLMs to retrieve more facts
2/ As the context window increases
3/ If facts are placed in the first half of the context
4/ When the LLM has to reason about retrieved facts
All code is open source:
https://lnkd.in/g-atxKuq
All runs can be seen here w/ public traces:
https://lnkd.in/gJi37_Tc
Write-up:
https://lnkd.in/gPwK5SEu
Short video explainer:
https://lnkd.in/gU4-_GdT
-