Make room for RAG: How Gen AI's balance of power is shifting

Much of the interest surrounding artificial intelligence (AI) is caught up with the battle of competing AI models on benchmark tests or new so-called multi-modal capabilities.

OpenAI announces a video capability, Sora, that stuns the world, Google responds with Gemini’s ability to pick out a frame of video, and the open-source software community quickly unveils novel approaches that speed past the dominant commercial programs with greater efficiency.

But users of Gen AI’s large language models, especially enterprises, may care more about a balanced approach that produces valid answers speedily.

- Advertisement -

A growing body of work suggests the technology of retrieval-augmented generation, or RAG, could be pivotal in shaping the battle between large language models (LLMs).

RAG is the practice of having an LLM respond to a prompt by sending a request to some external data source, such as a “vector database”, and retrieve authoritative data. The most common use of RAG is to reduce the propensity of LLMs to produce “hallucinations”, where a model asserts falsehoods confidently.

Commercial software vendors, such as search software maker Elastic, and “vector” database vendor Pinecone, are rushing to sell programs that let companies hook up to databases and retrieve authoritative answers grounded in, for example, a company’s product data.

- Advertisement -

What’s retrieved can take many forms, including documents from a document database, images from a picture file or video, or pieces of code from a software development code repository.

What’s already clear is the retrieval paradigm will spread far and wide to all LLMs, both for commercial and consumer use cases. Every generative AI program will have hooks into external sources of information.

Today, that process can be achieved with function calling, which OpenAI and Anthropic offer for their GPT and Claude programs respectively. Those simple mechanisms provide limited access to data for limited queries, such as getting the current weather in a city.

Function calling will probably have to meld with, or be supplanted, by RAG at some point to extend what LLMs can offer in response.

That shift implies RAG will become commonplace in how most AI models perform.

And that prominence raises issues. In this admittedly early phase of RAG’s development, different LLMs perform differently when using RAG, doing a better or worse job of handling the information that the RAG software sends back to the LLM from the database. That difference means that RAG becomes a new factor in the accuracy and utility of LLMs.

RAG, even as early as the initial training phase of AI models, could start to affect the design considerations for LLMs. Until now, AI models have been developed in a vacuum, built as pristine scientific experiments that have little connection to the rest of data science.

- Advertisement -

There may be a much closer relationship in the future between the building and training of neural nets for generative AI and the downstream tools of RAG that will play a role in performance and accuracy.

Pitfalls of LLMs with retrieval

Simply applying RAG has been shown to increase the accuracy of LLMs, but it can also produce new problems.

For example, what comes out of a database can lead LLMs into conflicts that are then resolved by further hallucinations.

In a report in March, researchers at the University of Maryland found that GPT-3.5 can fail even after retrieving data via RAG.

“The RAG system may still struggle to provide accurate information to users in cases where the context provided falls beyond the scope of the model’s training data,” they write. The LLM would at times “generate credible hallucinations by interpolating between factual content.”

Scientists are finding that design choices of LLMs can affect how they perform with retrieval, including the quality of the answers gotten back.

A study this month by scholars at Peking University noted that “the introduction of retrieval unavoidably increases the system complexity and the number of hyper-parameters to tune,” where hyper-parameters are choices made about how to train the LLM.

For example, when a model chooses from several possible “tokens”, including which tokens to pick from the RAG data, one can dial up or down how broadly it searches, meaning how great or narrow a pool of tokens to choose from.

Choosing a small group, known as “top-k sampling”, was found by the Peking scholars to “improve attribution but harm fluency,” so that what’s gotten back by the user has trade-offs in quality, relevance, and more.

Because RAG can dramatically expand the so-called context window, the number of total characters or words an LLM has to handle, using RAG can make a model’s context window a bigger issue than it would be.

Some LLMs can handle many more tokens — on the order of a million, for Gemini — some far less. That fact alone could make some LLMs better at handling RAG than others.

Both examples, hyper-parameters and context length affecting results, stem from the broader fact that, as the Peking scholars observe, RAG and LLMs each have “distinct objectives”. They weren’t built together, they’re being bolted together.

It may be that RAG will evolve more “advanced” techniques to align with LLMs better, or, it may be the case that LLM design has to start to incorporate choices that accommodate RAG earlier in the development of the model.

Trying to make LLMs smarter about RAG

Scholars are spending a lot of time these days studying in detail failure cases of RAG-enabled LLMs, in part to ask a fundamental question: what’s lacking in the LLM itself that is tripping things up?

Scientists at Chinese messaging firm WeChat described in a research paper in February how LLMs don’t always know how to handle the data they retrieve from the database. A model might spit back incomplete information given to it by RAG.

“The key reason is that the training of LLMs does not clearly make LLMs learn how to utilize input retrieved texts with varied quality,” write Shicheng Xu and colleagues.

To deal with that issue, they propose a special training method for AI models they call “an information refinement training method” named INFO-RAG, which they show can improve the accuracy of LLMs that use RAG data.

The idea of INFO-RAG is to use data retrieved with RAG upfront, as the training method for the LLM itself. A new dataset is culled from Wikipedia entries, broken apart into sentence pieces, and the model is trained to predict the latter part of a sentence fetched from RAG by being given the first part.

Therefore, INFO-RAG is an example of training a LLM with RAG in mind. More training methods will probably incorporate RAG from the outset, seeing that, in many contexts, using RAG is what one wants LLMs to do.

More subtle aspects of the RAG and LLM interaction are starting to emerge. Researchers at software maker ServiceNow described in April how they could use RAG to rely on smaller LLMs, which runs counter to the notion that the larger a large language model, the better.

“A well-trained retriever can reduce the size of the accompanying LLM at no loss in performance, thereby making deployments of LLM-based systems less resource-intensive,” write Patrice Béchard and Orlando Marquez Ayala.

If RAG substantially enables size reduction for many use cases, it could conceivably tilt the focus of LLM development away from the size-at-all-cost paradigm of today’s increasingly large models.

There are alternatives, with issues

The most prominent alternative is fine-tuning, where the AI model is retrained, after its initial training, by using a more focused training data set. That training can impart new capabilities to the AI model. That approach has the benefit of producing a model that could use specific knowledge encoded in its neural weights without relying on access to a database via RAG.

But there are issues particular to fine-tuning as well. Google scientists described this month that there are problematic phenomena in fine-tuning, such as the “perplexity curse”, in which the AI model cannot recall the necessary information if it’s buried too deeply in a training document.

That issue is a technical aspect of how LLMs are initially trained and requires special work to overcome. There can also be performance issues with fine-tuned AI models that degrade how well they perform relative to a plain vanilla LLM.

Fine-tuning also implies having access to the language model code to re-train it, which is a problem for those who don’t have source-code access, such as the clients of OpenAI or another commercial vendor.

As mentioned earlier, function calling today provides a simple way for GPT or Claude LLMs to answer simple questions. The LLM converts a natural language query such as “What’s the weather in New York City?” into a structured format with parameters, including name and a “temperature” object.

Those parameters are passed to a helper app designated by the programmer, and the helper app responds with the exact information, which the LLM then formats into a natural-language reply, such as: “It’s currently 76 degrees Fahrenheit in New York City.”

But that structured query limits what a user can do or what an LLM can be made to absorb as an example in the prompt. The real power of an LLM should be to field any query in natural language and use it to extract the right information from a database.

A simpler approach than either fine-tuning or function calling is known as in-context learning, which most LLMs do anyway. In-context learning involves presenting prompts with examples that give the model a demonstration that enhances what the model can do subsequently.

The in-context learning approach has been expanded to something called in-context knowledge editing (IKE), where prompting via demonstrations seeks to nudge the language model to retain a particular fact, such as, “Joe Biden”, in the context of a query, such as, “Who is the president of the US?”

The IKE approach, however, still may entail some RAG usage, as it has to draw facts from somewhere. Relying on the prompt can make IKE somewhat fragile, as there’s no guarantee the new facts will remain within the retained information of the LLM.

The road ahead

The apparent miracle of ChatGPT’s arrival in November of 2022 is the beginning of a long engineering process. A machine that can accept natural-language requests and respond in natural language still needs to be fitted with a way to have accurate and authoritative responses.

Performing such integration raises fundamental questions about the fitness of LLMs and how well they cooperate with RAG programs — and vice versa.

The result could be an emerging sub-field of RAG-aware LLMs, built from the ground up to incorporate RAG-based knowledge. That shift has large implications. If RAG knowledge is specific to a field or a company, then RAG-aware LLMs could be built much closer to the end user, rather than being created as generalist programs inside the largest AI firms, such as OpenAI and Google.

It seems safe to say RAG is here to stay, and the status quo will have to adapt to accommodate it, perhaps in many different ways.