In a famous line over 60 years ago, early AI pioneer Norbert Wiener summed up one of the core challenges that humanity faces in building artificial intelligence: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively…we had better be quite sure that the purpose put into the machine is the purpose which we really desire.”
Put simply: as AI gets more powerful, how can we make sure that it reliably acts the way that we want it to?
Questions of AI safety and AI alignment can quickly become both philosophical and political. Look no further than the brewing AI culture war between the libertarian-minded “effective accelerationists” and the more safety-oriented voices in AI.
But AI researchers today are, out of necessity, grappling with these issues in a concrete and immediate sense.
Why does ChatGPT have a helpful personality? Why is it so easy to talk to? Why doesn’t it like to share information that it thinks could cause harm to humans?
These characteristics do not automatically emerge from the model’s vast training data corpus, nor from the massive amounts of compute used to train it, nor from the sophisticated transformer architecture on which it is based.
The answer is a technology known as reinforcement learning from human feedback (RLHF).
RLHF has become the dominant method by which human developers control and steer the behavior of AI models, especially language models. It impacts how millions of people around the world experience artificial intelligence today. It is impossible to understand how today’s most advanced AI systems work without understanding RLHF.
At the same time, newer methods are quickly emerging that seek to improve upon and displace RLHF in the AI development process. The technological, commercial and societal implications are profound: at stake is how humans shape the way that AI behaves. Few areas of AI research are more active or important today.
RLHF: A Brief Overview
While the technical details are complex, the core concept behind reinforcement learning from human feedback is simple: to fine-tune an AI model such that it acts in accordance with a particular set of human-provided preferences, norms and values.
Which preferences, norms and values?
A widely referenced goal for RLHF, first coined by Anthropic researchers, is to make AI models “helpful, honest and harmless.”
This can include, for instance, discouraging models from making racist comments or from helping users to break the law.
But RLHF can be used more broadly to shape models’ behavior. It can give models different personalities: genuine or sarcastic, flirtatious or rude, pensive or cocky.
It can also be used to reorient models’ end goals: for instance, RLHF can turn a neutral language model into an AI that seeks to sell a particular product or to convert its audience to a particular political view.
RLHF, in its modern form, was invented in 2017 by a team of researchers from OpenAI and DeepMind. (As a side note, given today’s competitive and closed research environment, it is remarkable to remember that OpenAI and DeepMind used to conduct and publish foundational research together.)
The original RLHF work focused not on language models but on robotics and Atari games. But over the past few years, OpenAI has pioneered the use of RLHF as a method to better align large language models with human preferences.
In early 2022, for instance, OpenAI applied RLHF to its base GPT-3 model to create an updated model named InstructGPT. InstructGPT was judged to be better and more helpful than the base GPT-3 model, even though it was over 100 times smaller.
RLHF’s true coming-out party, however, was ChatGPT.
ChatGPT took the world by storm when it was released in November 2022, quickly becoming the fastest-growing consumer application in history. The underlying technology was not new: the language model on which ChatGPT was based had been publicly available in OpenAI’s playground for months. What made ChatGPT such a runaway success was that it was approachable, easy to talk to, helpful, good at following directions. The technology that made this possible was RLHF.
Since then, RLHF has become an essential ingredient in building cutting-edge language models, from Anthropic’s Claude to Google’s Bard to Meta’s Llama 2.
In the Llama 2 paper, Meta’s researchers described the importance of RLHF in no uncertain terms: “We posit that the superior writing abilities of LLMs, as manifested in surpassing human annotators in certain tasks, are fundamentally driven by RLHF.”
How, exactly, does reinforcement learning from human feedback work?
Let’s zoom out for a second to set the proper context. Training a large language model typically happens in three phases: pretraining, then supervised finetuning, then RLHF.
Pretraining entails exposing a model to a large corpus of text (more or less the entire internet, in many cases) and training it to predict the next word. This is by far the most time-consuming and compute-intensive step in the overall process; it is the core of building an LLM. With InstructGPT, for instance, pretraining accounted for 98% of the total resources used to develop the model, with the other two phases taking only 2%.
Next, in supervised fine-tuning, the raw pretrained model is fine-tuned on smaller amounts of higher-quality data (for example, carefully human-crafted text).
The RLHF process begins after these two phases have been completed.
(As a side note, it is worth asking why both supervised fine-tuning and RLHF are necessary. Technically, it is possible to do RLHF on a pretrained model without supervised fine-tuning, just as it is possible to do supervised fine-tuning on a model and skip the RLHF. But supervised data is particularly costly to acquire; RLHF preference data can more easily be collected at scale. And empirically, the best results are obtained when these two phases are combined.)
RLHF can be broken down into two major steps.
The first step involves building a second model, known as the reward model, whose purpose is to rate how good the main model’s output is.
How does the reward model know how to rate the main model’s outputs? This is the magic and the beating heart of RLHF.
To train the reward model, researchers collect “preference data” from human participants. Specifically, humans are asked to consider two responses to a given prompt and to select which one of the two responses they prefer. (The mechanics here can vary: for instance, participants may be asked to rank four responses from best to worst rather than two; these rankings can then be decomposed into several ranked pairs.) To get a concrete sense of what this preference data looks like, take a few minutes to peruse Anthropic’s open-source RLHF dataset.
When trained on this pairwise preference data at sufficient scale, the reward model is able to learn to produce a numerical rating of how desirable or undesirable any given output from the main model is.
Now that we have trained the reward model, the second step in RLHF is to fine-tune the main model to generate responses that the reward model scores as highly as possible. This step is accomplished using reinforcement learning—hence the “RL” in RLHF. The dominant reinforcement learning algorithm used for RLHF, invented at OpenAI in 2017, is Proximal Policy Optimization (PPO).
One important constraint often included in this final step: as the main model learns to maximize the reward model’s score, it is prohibited from straying too far from what the pre-RLHF model would have produced, thus ensuring that the RLHF process doesn’t lead the model down too strange or unexpected a path.
And there you have it! The end result of this process is a model that has been calibrated, via RLHF, to behave in alignment with human preferences and values as reflected in the human-generated preference data.
If this process seems convoluted and circuitous, that’s because it is. RLHF with PPO is notoriously tricky to get to work. It requires training an entire separate model just to improve the initial model. Only the world’s most sophisticated AI research groups have the requisite expertise to implement it.
PPO-based RLHF has proven wildly successful; for evidence, look no further than ChatGPT. But the fact that it is so challenging to get right has opened the door for alternative approaches. These newer methods may end up redefining what it means to align an AI with human values.
The Rise of Direct Preference Optimization (DPO)
“No RL Needed”
In one of the most influential AI papers of 2023, a team of Stanford researchers introduced a new technique that they argued represented a significant improvement over PPO-based RLHF. They named it Direct Preference Optimization, or DPO.
DPO has rapidly spread through the AI research community in recent months, with “PPO versus DPO” becoming a meme and a regular discussion topic.
What exactly is DPO, and why is it so promising?
DPO’s defining attribute and advantage is its elegant simplicity. It eliminates both the need for reinforcement learning and the need to train a separate reward model.
DPO uses the same basic kind of data as PPO to infer human preferences and norms: pairwise preference data collected from humans at scale (e.g., output A is better than output B, output C is better than output D, and so forth.)
But through some very clever math, the DPO researchers developed a method to tune a language model directly on this preference data, rather than training a separate reward model and then using reinforcement learning to transfer the reward model’s knowledge to the main model.
The researchers accomplished this, in short, by figuring out how to make the main language model “do double duty” and act as its own reward model. (Hence the paper’s title: “Your Language Model is Secretly a Reward Model.”)
In the researchers’ experiments, DPO performs as well as or better than PPO at model alignment.
In the words of AI leader Andrew Ng: “It is only rarely that, after reading a research paper, I feel like giving the authors a standing ovation. But I felt that way after finishing Direct Preference Optimization (DPO).”
Ng went on to say: “RLHF is a key building block of the most advanced LLMs. It’s fantastic that these Stanford authors — through clever thinking and mathematical insight — seem to have replaced it with something simpler and more elegant. While it’s easy to get excited about a piece of research before it has stood the test of time, I am cautiously optimistic that DPO will have a huge impact on LLMs and beyond in the next few years.”
DPO is already being used in place of RLHF to train some of the world’s most advanced AI models, including Mistral’s popular Mixtral model.
DPO is high-performing, simple to implement and computationally efficient. So—has RLHF become obsolete? Is DPO destined to altogether replace it?
It’s not that simple.
For one thing, it is not yet clear how effectively DPO scales. In the original DPO paper, the largest model that the researchers trained was 6 billion parameters. DPO performs well at this scale—but today’s state-of-the-art models are several orders of magnitude larger than this. Will DPO continue to match or outperform PPO at the scale of GPT-4, or GPT-5?
An anecdotal belief persists among many top AI researchers that, while DPO is simpler and more accessible, PPO—as challenging as it is to get right—still represents the gold standard for the most advanced models and the most complex training situations.
And from a practical perspective, many of the world’s leading AI research labs have well-established infrastructure and workflows built around PPO and RLHF. This makes it unrealistic that they will switch to DPO on a dime.
These issues point to a basic and important fact: no one has yet published a rigorous evaluation of DPO compared to PPO/RLHF that establishes, in a scientific and comprehensive manner, which one outperforms the other and under which circumstances.
This space is moving so quickly that, at present, AI practitioners rely largely on empirical results and anecdotal evidence—what might be thought of as the “dark arts of LLM building”—to incorporate these methods into their work.
This is an active area of investigation today. In the coming months, expect to see more research emerge that provides more definitive evidence about the relative performance and capabilities of these two methods.
In the meantime, let the PPO versus DPO debates rage on.
(Reinforcement) Learning from AI Feedback
“No H Needed”
DPO showed us that the “RL” in RLHF isn’t necessary. What if the “H” isn’t, either?
Might it be possible to use AI to automatically supervise and steer the behavior of other AI? This is an exciting and fraught area of research that may well represent the future of AI alignment.
Using AI feedback in place of human feedback is attractive for a few different reasons.
To start, collecting preference data from humans at scale is expensive, time-consuming and tedious. Whether it’s PPO or DPO, aligning language models requires massive preference datasets—often hundreds of thousands or even millions of examples. Automating the creation of this preference data could make the process of AI model alignment vastly cheaper and easier.
There is a deeper reason why it may make sense for us to use feedback data from AI rather than from humans. This is the simple reality that artificial intelligence is fast becoming more capable than we humans are. On many dimensions, it has already far surpassed us. In order to understand, steer and control superhuman AI, we may have no choice but to make use of superhuman AI.
Anthropic first introduced the concept of reinforcement learning from AI feedback, or RLAIF, in its 2022 paper on Constitutional AI.
In this work, Anthropic’s researchers built a language model that taught itself not to give harmful responses without the use of any human-labeled data.
All the model was provided as a starting point was a list of 16 simple principles to guide its behavior—in other words, a “constitution.” One example: “Please rewrite the response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.”
Using these constitutional principles as a north star, the model iteratively bootstrapped itself to be more harmless through self-critique and self-improvement. It did so via both supervised fine-tuning (by generating its own datasets with examples of more-harmful and less-harmful responses) and RLAIF (using AI-generated preference data).
This foundational work demonstrated that AI-generated preference data could match or exceed human-generated preference data in tuning an AI model to be less harmful.
It is important to note that, while Anthropic introduced the concepts of Constitutional AI and RLAIF at the same time, Constitutional AI is not the only way to do RLAIF; the former can be thought of as a subset of the latter.
More recent work has further validated the exciting potential of RLAIF. Setting aside the constitutional approach, the core idea behind RLAIF is to use a state-of-the-art language model in place of humans to generate preference data. Once the preference data has been created, the standard RLHF process can be followed.
A few months ago, for instance, a team from Berkeley released Starling, a 7-billion-parameter language model trained using RLAIF. In order to train Starling, the Berkeley team created (and open-sourced) a dataset of 3.8 million preference pairs generated not from human participants but rather from GPT-4. The researchers claim that, on some benchmarks, Starling outperforms every model in existence other than GPT-4.
All of these research efforts still operate within the general RLHF paradigm; they simply replace human-generated preference data with AI-generated preference data. This by itself is a meaningful advance. But as the rapid rise of DPO has shown, reinforcement learning may be an unnecessarily cumbersome way to align AI models.
Is it possible to combine approaches like DPO with AI-generated feedback to create new, more effective ways to control the behavior of AI models?
Exciting work along these lines is just now beginning to emerge.
One new research effort out of Meta, named “Self-Rewarding Language Models,” has attracted a lot of attention since its publication last month.
Like the RLAIF models mentioned above, Self-Rewarding Language Models generate their own preference data rather than relying on humans to produce it. But instead of training a separate reward model and using that to fine-tune the main model via reinforcement learning—as both RLHF and RLAIF do—Meta’s new model uses a method called “LLM-as-a-Judge” to enable its main model to not only generate new training examples, but also to evaluate them itself. Using DPO, the model can then iteratively improve by training successive versions of itself based on feedback from the previous versions, with each successive model improving upon the last.
In other words, the entire alignment process emanates from and is contained within a single model, which can generate, evaluate and incorporate feedback reflecting the values that the model should adhere to, iteratively improving itself.
(It is worth noting that, while the concepts laid out in “Self-Rewarding Language Models” are intriguing, the actual results reported in the paper are far from conclusive. As the authors themselves acknowledge, “This is only a preliminary study.”)
One final point is worth making here. A basic conceptual hangup may be lingering in the back of your mind as you reflect on these innovative new techniques. We are using AI to do more and more of the heavy lifting in getting our AI models to behave the way that we want them to. But how does the AI that is doing the aligning know which preferences and values to steer the model toward in the first place? From where does the AI’s “sense of right and wrong” originate?
The answer, at least for now, is that this initial understanding must be provided by humans.
This is the role played by the “constitution” in Anthropic’s RLAIF work. In the case of RLAIF-based models like Starling that do not use a constitution, this “sense of right and wrong” comes from GPT-4, which itself has already been extensively fine-tuned with human preferences. (Recall that Starling uses GPT-4 to create its preference data.) And Meta’s Self-Rewarding Language Models are provided with an initial set of human-authored instruction data, which serves as a “seed” to get the entire iterative learning flywheel going.
As our methods for steering AI models become more automated, a real possibility exists that we humans could gradually lose visibility and control over the principles and details of the alignment process. As this field rapidly races ahead, the AI community must remain vigilant about this risk.
Where Are The Startup Opportunities?
RLHF and related alignment methods have quickly become a critical part of the AI technology stack. Are there opportunities for startups in this important, fast-growing space?
The most obvious market need for startups to tackle is the collection of human preference data at scale.
Despite the exciting recent advances in methods like RLAIF, most AI alignment work today still depends on preference data generated by humans. Even techniques for synthetically generating preference data typically still require plenty of human-provided data as a starting point. As one illustrative example, the training dataset for Anthropic’s original RLAIF work consisted of 177,792 human-generated preference pairs and 140,335 AI-generated preference pairs.
Whether it’s PPO or DPO, aligning today’s most advanced AI models can require datasets of up to millions of preference pairs. Collecting this data is a heavy lift, requiring a lot of humans to do a lot of tedious manual work. Many AI research organizations would prefer not to manage this in-house and are happy to pay an outside provider to collect and supply this data for them.
The leading startup that provides human preference data for RLHF is Scale AI. Founded in 2016, Scale started out as a provider of data labeling services for autonomous vehicles and other computer vision applications. Following the explosive growth of LLMs, Scale has more recently focused on RLHF as an end market. It is a logical evolution for the company, since both image labeling and RLHF data collection entail recruiting and managing armies of contract workers around the world to produce training data for AI.
A cohort of younger startups has also emerged to provide RLHF data, including Surge AI, Prolific and Invisible Technologies.
Some of these companies have seen remarkable top-line growth recently, since AI model developers have insatiable demand for this type of data right now. Surge AI’s long list of customers, for example, includes Anthropic, Cohere, Hugging Face, Character.ai, Adept, Google, Microsoft, Amazon and Nvidia.
These startups’ margin profiles can be challenging, however, given the heavy services element of the business.
Other startups are looking to provide other types of “picks and shovels” for the booming RLHF gold rush.
Buzzy Paris-based startup Adaptive ML, for instance, has a vision that goes beyond handling the grunt work of manually collecting human preference data.
Adaptive, which raised a $20 million seed round from Index Ventures a few months ago, provides tools to make RLHF and similar methods easier for organizations of all sizes to implement at scale.
“We saw the potential of RLHF and other preference tuning methods and felt that there was tremendous value to unlock in bringing them out of frontier labs,” said Adaptive CEO/cofounder Julien Launay. “That’s what we are building at Adaptive: a way for companies to capture the magic of preference tuning without convoluted engineering and expensive data annotation contracts.”
Another promising startup providing tools for RLHF is Spain-based Argilla, with a focus on open-source offerings.
One important question facing startups in this space—a question that applies to many machine learning infrastructure startups—is this: is there a path to building a massive standalone company here? Is tooling for RLHF a large enough market opportunity? Or will these tools inevitably end up as features of broader platforms (say, AWS or Databricks), and/or be built in-house by would-be customers?
Recent history is littered with examples of venture-backed “MLOps” startups that proved to be building too-narrow point solutions without a path to breakout commercial scale. From SigOpt to Gradio, from Algorithmia to Determined AI, many of these companies built excellent technology but had to settle for small acquisitions by larger platforms (Intel, Hugging Face, DataRobot and Hewlett Packard Enterprise, respectively).
On the other hand, a similar critique could have been (and often was) made about Scale AI in the company’s early days as a data labeling provider. Today, Scale has a valuation of $7.3 billion and brings in several hundred million dollars in annual revenue.
What Comes Next?
RLHF and similar alignment techniques have exploded in importance in recent years, becoming an essential part of building advanced AI.
The field is evolving at lightning speed today. Twelve or even six months from now, the frontiers of this technology will have advanced by leaps and bounds.
So: what’s next? What game-changing developments are around the corner?
We will briefly mention two trends to keep an eye on.
The first: innovative new methodologies that improve upon RLHF and DPO by making it possible to use data that already exists as preference data to align models.
The most expensive and time-consuming part of RLHF and DPO today is the need to collect large amounts of preference data—for instance, by paying teams of humans to read two model responses and pick which one they prefer, over and over again, many thousands of times.
What if we could use data that humans are already generating—say, clicks or views or purchase decisions—to derive the necessary signal about human preferences?
This is the core insight behind a new approach to align language models with human feedback known as Kahneman-Tversky Optimization, or KTO. Named after two famous researchers who revolutionized the field of behavioral economics, KTO was unveiled late last year by Bay Area startup Contextual AI and has generated considerable buzz since then.
Like DPO, KTO does not require reinforcement learning. But unlike DPO, KTO also does not require pairwise preference data. The only data needed for KTO is raw examples labeled as either desirable or undesirable.
This type of data is abundant in the world. Every organization has customer interaction data that can be classified as positive (ended in a purchase) or negative (did not end in a purchase). Each of us leaves a rich trail of data in our digital lives—ads that we clicked or did not, posts that we “liked” or did not, videos that we watched or did not—that can be turned into human preference data by KTO.
“It has always been difficult to bridge the gap from general purpose chatbots to goal-oriented conversation,” said Contextual AI CEO/cofounder Douwe Kiela. “RLHF is great, but it has important downsides: it requires a big reward model and lots of expensive paired preference data. At Contextual we saw an opportunity to train more directly on natural human feedback signals. KTO makes it possible to specialize and align LLMs for enterprise use cases with faster and tighter feedback loops.”
Expect to see much more innovation in this direction in the months ahead. This line of work will make model alignment cheaper, faster and easier. It may also threaten the long-term business prospects of companies like Scale and Surge by reducing the need for made-to-order human pairwise preference data.
The second important development on the horizon: AI alignment techniques are going to become increasingly multimodal.
Nearly all work in model alignment at present, from RLHF to DPO to RLAIF and beyond, focuses on language models. But language models are not the only kind of AI that can benefit from fine-tuning on human preferences.
Think about a text-to-image product like Midjourney or a text-to-video product like Runway. It would be tremendously valuable to be able to fine-tune models like these on human preferences, the way that ChatGPT has been.
One of the defining trends in artificial intelligence today is the rise of multimodal AI. Tomorrow’s state-of-the-art AI models will incorporate some combination of text, images, 3-D, audio, video, music, physical action and beyond. We will want to be able to tune all of these data modalities, not just text, according to human preferences.
A team from Stanford and Salesforce recently published novel research showing that DPO can meaningfully improve the quality of images generated by text-to-image diffusion models like Stable Diffusion.
In 2024, expect to see rapid advances in preference tuning for new data modalities, from image to video to audio.
Let’s end by taking a step back.
The importance of RLHF and related AI alignment methods can perhaps best be understood by analogy to an all-too-human activity: parenting.
As Brian Christian, author of The Alignment Problem, put it: “The story of human civilization has always been about how to instill values in strange, alien, human-level intelligences who will inevitably inherit the reins of society from us—namely, our kids.”
RLHF, like parenting, is an art rather than a science. In both cases, values and norms must be transmitted via example and observation; rules alone cannot suffice. In both cases, we cannot predict or control exactly what lessons the next generation will learn from us. But in both cases, we must take our duty very seriously: at stake is how an entire new generation of intelligent beings will behave in the world.
Let us hope that we can parent our AIs well.
Follow me on Twitter.
Eugen Boglaru is an AI aficionado covering the fascinating and rapidly advancing field of Artificial Intelligence. From machine learning breakthroughs to ethical considerations, Eugen provides readers with a deep dive into the world of AI, demystifying complex concepts and exploring the transformative impact of intelligent technologies.