Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels
Last Updated on November 3, 2024 by Editorial Team
Author(s): Devashish Datt Mamgain
Originally published on Towards AI.
With their latest release of o1, Open AI touted the advantages of reinforcement learning in their research. They employed RL to improve intermediate steps in the modelβs chain of thought and got improved answers.
However, RL is not a silver bullet and especially struggles when there is no objective answer to a question. The most famous accomplishment of the process has been in AlphaGo, where there was a definite way to determine the winner of a game.
So, how do you map human preference without clear winners? RLHF and soft preference labels may help.
RLHF (Reinforcement Learning from Human Feedback)
The principle for RLHF is reasonably intuitive and straightforward, and according to recent research from Anthropic, it can create more helpful LLMs. A typical RLHF process will undertake the following steps while training an LLM:
- Collect Demonstration Data for Fine-Tuning
While pre-training on many unlabelled data, LLMs donβt learn to answer prompts or questions. This behavior has to be deliberately taught by experts through QA pairs that fine-tune and instruct the model on how to answer a particular prompt.
2. Grade the Responses from the Fine-Tuned Model
Once youβve instruction-tuned a model, you need to ask it prompts and then ask a labeler to rank this prompt. These rankings then train a reward model that can rank LLM responses.
3. Update the Policy (LLM)
Every LLM has a set of parameters through which it approaches answers. This is determined by policy. Open AI updates this policy using PPO (Proximal Policy Optimization) to incrementally change the parameters (within a βSafeβ range) and affect the created answer. The delta of this change is scalar and is derived from the reward model, which can now evaluate the quality of the answers.
There have been multiple approaches to performing RLHF (including the brilliant RLOO paper from Cohere, which I discuss briefly in my article about multilingualism in LLMs); however, the relative complexity of RLHF has always been a barrier.
RLHF can be computationally intensive and requires significant human capital (or an engaged user base).
An alternative method that has been prescribed is Direct Policy Optimization (DPO), which eliminates the human-in-the-loop.
Direct Policy Optimization (DPO)
IN RLHF, we seek out the preferences of human labelers and set up a function to find the reward model. In the context of the reward model, weβre looking for a reward model that brings us closest to the human-labeled answer.
So, we can set up an optimization like this:
In this equation, weβre trying to find the best policy value for the optimized outcome. The reward model is depicted by r, and the Ξ²DkL element acts as a penalty term that limits the divergence the reward model can drive.
In DPO, we do something similar. However, instead of mapping out a human preference policy, we use the language model as a reward model in terms of optimal policy (the one that generates the most favored answers) and reference policy (the one we have as the baseline).
This process works as follows:
- An offline database of human preferences is collected during the fine-tuning process.
- We optimize the language model to a policy where the chances of a wrong reward estimate are minimized for a particular reference policy and KL divergence.
- We assume that the reference model is the same as the supervised fine-tuned (SFT) model, and when no completions from the SFT model are available, we use the maximized likelihood of a preferred completion as the reference,
In essence, DPO transforms the human preference part of RLHF into an exercise where the KL difference between the optimized and reward model is minimized.
However, itβs essential to understand that human preference isnβt binary. In RLHF and DPO, we often choose the winner between a couple of completions. However, this doesnβt capture the human experience, which usually acts beyond binary mechanisms. This brings us to a recent publication that describes soft preference labels.
Soft Preference Labels
Letβs take an intuitive example: you might have chosen between chocolate and vanilla flavors or ice cream. However, this subjective preference will differ from human to human. However, if youβre the only human labeler for this particular prompt (one that asks βFavorite ice creamβ), the LLM starts taking your answer as gospel).
We need to account for this difference in peopleβs preferences and use it to better train LLMs for human use.
This is a paradigm that can be captured through soft preference labels.
Essentially, the authors suggest the following in terms of soft preference labels:
- They will be a probabilistic score between 0.5 (equally preferable answers) and 1 (distinct advantage to one answer).
- You can use majority voting or other methods to capture these probabilities.
- Equally favored answers are disfavored while learning, while answers with a distinct preference are favored.
Practically, if the score of two answers is 0.5, the LLM will learn less from that, and if it is closer to 1, it will learn more. - The weighted average of these preference scores is taken so the LLM can develop some nuance about human preference for one answer over another.
The authors take these labels and then use Geometric Direct Preference Optimization (GDPO) to get to their solution for RL. This means:
- Aside from the traditional DPO equation, the GDPO factor includes a new factor for soft-preference labels.
- These soft-preference labels act as discriminators and reduce the weight of responses with a preference closer to 0.5.
- The loss function now looks like the following:
This proposition is exciting because it increases the nuance in LLMs and provides better outputs overall.
In the paper, the authors demonstrate that including soft preference labels increases the alignment score of the LLM models and provides a better way to perform RL.
Future Direction and Some Thoughts
The launch of Orion-Preview showcases that RL is one of the core concepts that power the LLMs of today. In this context, itβs essential to realize that the current RL algorithms can be limited when viewed through human preference.
Companies have struggled to develop reward optimization mechanisms that improve answers whenever there are no objective answers. However, clear answers are a rarity in human thought (which explains why o1 shows improvement in maths and sciences but struggles with linguistic and philosophical problems).
As we move towards industry use cases where LLMs will be used for more human conversations (I am a fan of Advanced Voice Mode in ChatGPT), we will need language models that develop more nuance. This is where I think soft preference labels are the right step.
Whether it will be implemented on a large scale is still a rather difficult question to answer, but we have a slightly more nuanced view of human preferences that doesnβt cost a computational arm and leg.
References
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv preprint arXiv:2305.18290, 2023.
H. Furuta, K.-H. Lee, S. S. Gu, Y. Matsuo, A. Faust, H. Zen, and I. Gur, Geometric-Averaged Preference Optimization for Soft Preference Labels, arXiv preprint arXiv:2409.06691, 2024.
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv preprint arXiv:2204.05862, 2022.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI