Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels

Last Updated on November 3, 2024 by Editorial Team

Author(s): Devashish Datt Mamgain

Originally published on Towards AI.

Image from author, inside image generated by Flux LoRA

With their latest release of o1, Open AI touted the advantages of reinforcement learning in their research. They employed RL to improve intermediate steps in the model’s chain of thought and got improved answers.

However, RL is not a silver bullet and especially struggles when there is no objective answer to a question. The most famous accomplishment of the process has been in AlphaGo, where there was a definite way to determine the winner of a game.

So, how do you map human preference without clear winners? RLHF and soft preference labels may help.

RLHF (Reinforcement Learning from Human Feedback)

The principle for RLHF is reasonably intuitive and straightforward, and according to recent research from Anthropic, it can create more helpful LLMs. A typical RLHF process will undertake the following steps while training an LLM:

Collect Demonstration Data for Fine-Tuning

While pre-training on many unlabelled data, LLMs don’t learn to answer prompts or questions. This behavior has to be deliberately taught by experts through QA pairs that fine-tune and instruct the model on how to answer a particular prompt.

2. Grade the Responses from the Fine-Tuned Model

Once you’ve instruction-tuned a model, you need to ask it prompts and then ask a labeler to rank this prompt. These rankings then train a reward model that can rank LLM responses.

3. Update the Policy (LLM)

Every LLM has a set of parameters through which it approaches answers. This is determined by policy. Open AI updates this policy using PPO (Proximal Policy Optimization) to incrementally change the parameters (within a “Safe” range) and affect the created answer. The delta of this change is scalar and is derived from the reward model, which can now evaluate the quality of the answers.

There have been multiple approaches to performing RLHF (including the brilliant RLOO paper from Cohere, which I discuss briefly in my article about multilingualism in LLMs); however, the relative complexity of RLHF has always been a barrier.

RLHF can be computationally intensive and requires significant human capital (or an engaged user base).

An alternative method that has been prescribed is Direct Policy Optimization (DPO), which eliminates the human-in-the-loop.

Direct Policy Optimization (DPO)

IN RLHF, we seek out the preferences of human labelers and set up a function to find the reward model. In the context of the reward model, we’re looking for a reward model that brings us closest to the human-labeled answer.

So, we can set up an optimization like this:

In this equation, we’re trying to find the best policy value for the optimized outcome. The reward model is depicted by r, and the βDkL element acts as a penalty term that limits the divergence the reward model can drive.

In DPO, we do something similar. However, instead of mapping out a human preference policy, we use the language model as a reward model in terms of optimal policy (the one that generates the most favored answers) and reference policy (the one we have as the baseline).

This process works as follows:

An offline database of human preferences is collected during the fine-tuning process.
We optimize the language model to a policy where the chances of a wrong reward estimate are minimized for a particular reference policy and KL divergence.
We assume that the reference model is the same as the supervised fine-tuned (SFT) model, and when no completions from the SFT model are available, we use the maximized likelihood of a preferred completion as the reference,

In essence, DPO transforms the human preference part of RLHF into an exercise where the KL difference between the optimized and reward model is minimized.

However, it’s essential to understand that human preference isn’t binary. In RLHF and DPO, we often choose the winner between a couple of completions. However, this doesn’t capture the human experience, which usually acts beyond binary mechanisms. This brings us to a recent publication that describes soft preference labels.

Soft Preference Labels

Let’s take an intuitive example: you might have chosen between chocolate and vanilla flavors or ice cream. However, this subjective preference will differ from human to human. However, if you’re the only human labeler for this particular prompt (one that asks “Favorite ice cream”), the LLM starts taking your answer as gospel).

We need to account for this difference in people’s preferences and use it to better train LLMs for human use.

This is a paradigm that can be captured through soft preference labels.

Essentially, the authors suggest the following in terms of soft preference labels:

They will be a probabilistic score between 0.5 (equally preferable answers) and 1 (distinct advantage to one answer).
You can use majority voting or other methods to capture these probabilities.
Equally favored answers are disfavored while learning, while answers with a distinct preference are favored.
Practically, if the score of two answers is 0.5, the LLM will learn less from that, and if it is closer to 1, it will learn more.
The weighted average of these preference scores is taken so the LLM can develop some nuance about human preference for one answer over another.

The authors take these labels and then use Geometric Direct Preference Optimization (GDPO) to get to their solution for RL. This means:

Aside from the traditional DPO equation, the GDPO factor includes a new factor for soft-preference labels.
These soft-preference labels act as discriminators and reduce the weight of responses with a preference closer to 0.5.
The loss function now looks like the following:

This proposition is exciting because it increases the nuance in LLMs and provides better outputs overall.

In the paper, the authors demonstrate that including soft preference labels increases the alignment score of the LLM models and provides a better way to perform RL.

Future Direction and Some Thoughts

The launch of Orion-Preview showcases that RL is one of the core concepts that power the LLMs of today. In this context, it’s essential to realize that the current RL algorithms can be limited when viewed through human preference.

Companies have struggled to develop reward optimization mechanisms that improve answers whenever there are no objective answers. However, clear answers are a rarity in human thought (which explains why o1 shows improvement in maths and sciences but struggles with linguistic and philosophical problems).

As we move towards industry use cases where LLMs will be used for more human conversations (I am a fan of Advanced Voice Mode in ChatGPT), we will need language models that develop more nuance. This is where I think soft preference labels are the right step.

Whether it will be implemented on a large scale is still a rather difficult question to answer, but we have a slightly more nuanced view of human preferences that doesn’t cost a computational arm and leg.

References

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv preprint arXiv:2305.18290, 2023.

H. Furuta, K.-H. Lee, S. S. Gu, Y. Matsuo, A. Faust, H. Zen, and I. Gur, Geometric-Averaged Preference Optimization for Soft Preference Labels, arXiv preprint arXiv:2409.06691, 2024.

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv preprint arXiv:2204.05862, 2022.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels

Author(s): Devashish Datt Mamgain

RLHF (Reinforcement Learning from Human Feedback)

Direct Policy Optimization (DPO)

Soft Preference Labels

Future Direction and Some Thoughts

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Arbitration for AI: A New Frontier in Governing Uncensored Models

Fine-Tuning vs Distillation vs Transfer Learning: What’s The Difference?

#63: Full of Frameworks: APDTFlow, NSGM, MLFlow, and more!

Vector Databases 101: A Beginner’s Guide to Vector Search and Indexing

AI Agent Developer: A Journey Through Code, Creativity, and Curiosity

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels

Author(s): Devashish Datt Mamgain

RLHF (Reinforcement Learning from Human Feedback)

Direct Policy Optimization (DPO)

Future Direction and Some Thoughts

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement