Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels
Latest   Machine Learning

Optimizing AI for Human Preference: RLHF, DPO, and Soft Preference Labels

Last Updated on November 3, 2024 by Editorial Team

Author(s): Devashish Datt Mamgain

Originally published on Towards AI.

Image from author, inside image generated by Flux LoRA

With their latest release of o1, Open AI touted the advantages of reinforcement learning in their research. They employed RL to improve intermediate steps in the model’s chain of thought and got improved answers.

However, RL is not a silver bullet and especially struggles when there is no objective answer to a question. The most famous accomplishment of the process has been in AlphaGo, where there was a definite way to determine the winner of a game.

So, how do you map human preference without clear winners? RLHF and soft preference labels may help.

RLHF (Reinforcement Learning from Human Feedback)

The principle for RLHF is reasonably intuitive and straightforward, and according to recent research from Anthropic, it can create more helpful LLMs. A typical RLHF process will undertake the following steps while training an LLM:

  1. Collect Demonstration Data for Fine-Tuning

While pre-training on many unlabelled data, LLMs don’t learn to answer prompts or questions. This behavior has to be deliberately taught by experts through QA pairs that fine-tune and instruct the model on how to answer a particular prompt.

2. Grade the Responses from the Fine-Tuned Model

Once you’ve instruction-tuned a model, you need to ask it prompts and then ask a labeler to rank this prompt. These rankings then train a reward model that can rank LLM responses.

3. Update the Policy (LLM)

Every LLM has a set of parameters through which it approaches answers. This is determined by policy. Open AI updates this policy using PPO (Proximal Policy Optimization) to incrementally change the parameters (within a β€œSafe” range) and affect the created answer. The delta of this change is scalar and is derived from the reward model, which can now evaluate the quality of the answers.

There have been multiple approaches to performing RLHF (including the brilliant RLOO paper from Cohere, which I discuss briefly in my article about multilingualism in LLMs); however, the relative complexity of RLHF has always been a barrier.

RLHF can be computationally intensive and requires significant human capital (or an engaged user base).

An alternative method that has been prescribed is Direct Policy Optimization (DPO), which eliminates the human-in-the-loop.

Direct Policy Optimization (DPO)

IN RLHF, we seek out the preferences of human labelers and set up a function to find the reward model. In the context of the reward model, we’re looking for a reward model that brings us closest to the human-labeled answer.

So, we can set up an optimization like this:

Image from Author

In this equation, we’re trying to find the best policy value for the optimized outcome. The reward model is depicted by r, and the Ξ²DkL element acts as a penalty term that limits the divergence the reward model can drive.

In DPO, we do something similar. However, instead of mapping out a human preference policy, we use the language model as a reward model in terms of optimal policy (the one that generates the most favored answers) and reference policy (the one we have as the baseline).

This process works as follows:

  1. An offline database of human preferences is collected during the fine-tuning process.
  2. We optimize the language model to a policy where the chances of a wrong reward estimate are minimized for a particular reference policy and KL divergence.
  3. We assume that the reference model is the same as the supervised fine-tuned (SFT) model, and when no completions from the SFT model are available, we use the maximized likelihood of a preferred completion as the reference,

In essence, DPO transforms the human preference part of RLHF into an exercise where the KL difference between the optimized and reward model is minimized.

However, it’s essential to understand that human preference isn’t binary. In RLHF and DPO, we often choose the winner between a couple of completions. However, this doesn’t capture the human experience, which usually acts beyond binary mechanisms. This brings us to a recent publication that describes soft preference labels.

Soft Preference Labels

Let’s take an intuitive example: you might have chosen between chocolate and vanilla flavors or ice cream. However, this subjective preference will differ from human to human. However, if you’re the only human labeler for this particular prompt (one that asks β€œFavorite ice cream”), the LLM starts taking your answer as gospel).

We need to account for this difference in people’s preferences and use it to better train LLMs for human use.

This is a paradigm that can be captured through soft preference labels.

Essentially, the authors suggest the following in terms of soft preference labels:

  1. They will be a probabilistic score between 0.5 (equally preferable answers) and 1 (distinct advantage to one answer).
  2. You can use majority voting or other methods to capture these probabilities.
  3. Equally favored answers are disfavored while learning, while answers with a distinct preference are favored.
    Practically, if the score of two answers is 0.5, the LLM will learn less from that, and if it is closer to 1, it will learn more.
  4. The weighted average of these preference scores is taken so the LLM can develop some nuance about human preference for one answer over another.

The authors take these labels and then use Geometric Direct Preference Optimization (GDPO) to get to their solution for RL. This means:

  1. Aside from the traditional DPO equation, the GDPO factor includes a new factor for soft-preference labels.
  2. These soft-preference labels act as discriminators and reduce the weight of responses with a preference closer to 0.5.
  3. The loss function now looks like the following:
Image from Author

This proposition is exciting because it increases the nuance in LLMs and provides better outputs overall.

In the paper, the authors demonstrate that including soft preference labels increases the alignment score of the LLM models and provides a better way to perform RL.

Future Direction and Some Thoughts

The launch of Orion-Preview showcases that RL is one of the core concepts that power the LLMs of today. In this context, it’s essential to realize that the current RL algorithms can be limited when viewed through human preference.

Companies have struggled to develop reward optimization mechanisms that improve answers whenever there are no objective answers. However, clear answers are a rarity in human thought (which explains why o1 shows improvement in maths and sciences but struggles with linguistic and philosophical problems).

As we move towards industry use cases where LLMs will be used for more human conversations (I am a fan of Advanced Voice Mode in ChatGPT), we will need language models that develop more nuance. This is where I think soft preference labels are the right step.

Whether it will be implemented on a large scale is still a rather difficult question to answer, but we have a slightly more nuanced view of human preferences that doesn’t cost a computational arm and leg.

References

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv preprint arXiv:2305.18290, 2023.

H. Furuta, K.-H. Lee, S. S. Gu, Y. Matsuo, A. Faust, H. Zen, and I. Gur, Geometric-Averaged Preference Optimization for Soft Preference Labels, arXiv preprint arXiv:2409.06691, 2024.

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv preprint arXiv:2204.05862, 2022.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓