Ryght by your side

Ryght At The Forefront: Is Retrieval-Augmented Generation (RAG) Facing Extinction?

Written by Kenneth (Kirby) Bloom | Apr 16, 2024 6:13:26 AM


When you start judging a white paper's age in months versus years, it is a pretty good sign that things are moving at warp speed. Never in my life did I think I'd discount the value of reading something written over four months ago. Case in point, I now have a collection of documents on retrieval-augment generation (RAG) patterns that are on the edge of extinction

At Ryght , we often find ourselves in conversations with teams that are contemplating “Build vs Buy” in the generative AI space. As part of that, we get to talk to a lot of really smart, capable folks that have kicked the tires on a few things here and there. Maybe they've spun up a chat bot or even read a few white papers. Honestly, we love talking to those folks for various reasons but with that, you will always get that person who wants to find flaws in your offering because they secretly want to build what you are pitching. Lately, at the point which we highlight our RAG capabilities, we've been getting the question, "With larger context windows becoming more accessible, isn't RAG going to be dead?"

The short answer has been, “No”. At the core of it, RAG is the ability to augment a model with data it was not trained with. The ability to fill up a context window with a really large document is still RAG because you are going to have to retrieve it and you’re still augmenting the model’s response with it. The real question sits a little lower down the stack, and that is, how efficient is it to chunk your documents to be used with RAG? Depending on the project requirements, the need to split up your information into bite sized fragments for processing might become less important as it once was.

That said, I am still a RAG advocate for folks who have a good chunking and retrieval strategy in place.  Let's hit a few points:

Cost & Latency:

Recently I came across this post that nails the cost component of the equation. If you are confident in your ability to pull out the relevant portions of larger documents you might be simply wasting money while adding more noise for the model to just throw out. 

If anything, consider what that bandwidth could allow you to do with chat history or other sources to improve response. For those with experience in genomics, I made the analogy to Alex Dickinson:  Why do a whole genome sequence when a targeted panel or an array will get you the data you need? 

Accuracy 

While models such as Gemini 1.5, Claude 2.1 and GPT 4 Turbo boast a larger context window compared to other models available today, they are all achieving around 60% average recall (shown below) as you stretch the context window. This indicates a significant chance that crucial information could be missed based on its positioning in the window.

The evaluation(s) above focus on finding a single piece of information within a limited context. However, real-world applications of RAG architectures involve extracting multiple relevant points from various sources to inform a model's response. To better assess a model's capabilities, we should evaluate its ability to find multiple "needles" within a larger context window. Google's report demonstrates this concept by showcasing Gemini 1.5's ability to extract 100 such needles in a single turn. It is also hovering around the 60% recall rate. 

Understanding this risk for tasks where completeness is critical - like many in healthcare & life sciences - is paramount to utilizing these models correctly. The ability to employ other strategies to ensure the most relevant content is leveraged becomes crucial for success. Many RAG-based strategies can help increase the chances that all the data you provide is actually considered by the model with the ultimate goal of maximizing accuracy and reducing the risk of overlooking significant details. 

Modularity
Building from the risks related to accuracy, it is important to have a system in place that can give you more knobs to twist and levers to pull along the way.  An orchestration layer acts as a conductor, managing the flow of data and tasks within the LLM system. As part of that layer, RAG, specifically, allows you to retrieve relevant information from external sources during the generation process.

If you just stuff everything into the context window, you are robbing yourself the opportunity to tune along the way. Instead, breaking down your workflow into more composable tasks using RAG can help you isolate the right information at the right time giving you more control over the whole process. 

LLM applications are already difficult to troubleshoot. Taking out these control points will only make things exponentially harder to debug when model responses are not hitting the mark. 

Data Governance

There's one more critical aspect to consider in support of the RAG-based approach: Fine-grained permissions on data sources.  Imagine you're building an LLM system within an organization. Traditionally, granting access to data sources often involves a complex web of permissions, making it difficult to manage who can see what.

This is where fine-grained permissions come in. By adopting a more modular approach in retrieval, we can break down access controls to individual pieces of information. This allows us to leverage authorization tools (AuthZ) in combination with RAG to precisely control what data feeds into the LLM for each user.

The benefit? Tighter governance. Instead of granting access to an entire document, which might expose sensitive information, we can create a nuanced system that can control access while being permission aware. This ensures the right people have access to the right data points in your document, ultimately improving the quality of information fed to the LLM and upholding governance.

To close this out, I’ll admit that one can argue that many of the items could be nullified via a race to the bottom for all things related to cost and latency. To some degree, you might be able to take that same stance when it comes to accuracy with the assumption that these models will only get better, cheaper, and faster. For that, we are equally as excited to use these larger context models like everyone else. 

That said, I still believe in a more granular approach versus that of a brute force, “Let’s throw the kitchen sink at it,” mentality, especially when working in the enterprise where fine-grained permissions on data are paramount and data sources are abundant. 

If you choose to run these larger context windows at scale and just stuff it with content, when problems arise, you might start to think… “Mo’ tokens, mo’ problems”? 🤔

Also - In case you missed it, read the previous article in our “Ryght at the Forefront” thought-leadership series by Ryght’s own Founder/CEO and serial entrepreneur, Simon Arkell OLY , about the “The GenAI Enterprise Unlock (and wooly mammoths)”.