Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Learn Reinforcement Learning from Human Feedback (RLHF): Your 9-Hour Study Plan
Latest   Machine Learning

Learn Reinforcement Learning from Human Feedback (RLHF): Your 9-Hour Study Plan

Last Updated on January 31, 2025 by Editorial Team

Author(s): Peyman Kor

Originally published on Towards AI.

A visual explanation of RLHF β€” Image Source: Author

Introduction

This blog post is for people who have heard the term Reinforcement Learning from Human Feedback (RLHF) and are curious about this method. In this blog, you will learn how RLHF works, why it matters, and how you can master the fundamentals in just 9 hours.

In short, RLHF has been developed to transform Large Language Models (LLM) from AI models to AI assistants. That is the core idea. This means that AI companies (OpenAI, Google, or Anthropic, etc.) can make AI products that are more helpful and useful for their customers. Indeed that is a crucial aspect of any product development strategy because ultimately, you want to build products that are useful for your human customers.

Background on Foundation Models

Now let’s see how modern Large Language Models (LLMs) are trained and developed. The training process of these models typically involves three phases:

Phase 1: Pre-training

In this phase, the model learns from vast amounts of internet data (typically trillions of tokens) to predict the next word in a sequence. Think of it as learning the statistical patterns of language. For example, given the phrase β€œMy favorite football team in UK is the ___β€œ, the model learns that ” Manchester United or Liverpool are more likely completions (higher probability) than β€œcar” or β€œpizza”. This phase helps the model to develop a basic understanding of language structure and meaning. To get an intuition about how the probability of the next word is predicted, I encourage you to go and play with the Transformer Explainer website [here (https://transformer.huggingface.co/). You add a sentence at the top, and on the far right side, you can see the probability of the next word.

How transformers work (Phase 1): A visual guide β€” Image Source: Author, adapted from the public website https://poloclub.github.io/transformer-explainer/"

If you look at the far right part of the above figure, you can see the likelihood of each word to complete the sentence:

β€œMy favorite football team in UK is the ___β€œ

As you can see, here are the probabilities of completion of the sentence:

  • Manchester: around 6%
  • Rams?? : around 5 %
  • List goes on

Phase 2: Supervised Fine-tuning (SFT)

Language is inherently flexible β€” there are many valid ways to respond
to any prompt. For example, if asked β€œHow do I make pasta?”, valid
responses could include:

A) You can learn by watching YouTube

B) Giving you a listing of ingredients, explaining cooking steps,
or suggesting serving sizes.

And of course, as a user, you prefer option B, which is a completion you prefer. In the Supervised Fine-tuning stage (Phase 2), we show the model good examples of questions and answers to teach it how to respond well. The data used in this phase is typically collected from highly educated workers. They are high-quality data. Here are some examples of supervised fine-tuning datasets:

An example of supervised fine-tuning (SFT) data β€” Image Source: Author, originally from the OpenAI paper [1]

Phase 3: Reinforcement Learning from Human Feedback (RLHF)

Even after phase 2, some responses are better than others from a human
perspective. In Phase 3 we show the model’s various responses and ask
humans to rate them. Essentially, we would like to optimize the model for
the output to get higher ratings from humans. This stage that brings
human feedback into the loop is called Reinforcement Learning from Human Feedback (RLHF).

Here is a simple example. We give the prompt β€œExplain the moon landing to a 6-year-old in a few sentences” to LLM β€” Then LLM gives two possible answers, and we ask humans which one they prefer. Then, this preference becomes a β€œsignal” to LLMs, so that they need to make more of the response that humans would like to see.

Human preference β€” Image Source: Adapted from a public presentation, available , here

Now with this background, let’s see how we can learn about RLHF.

Study Roadmap

To help you navigate this topic, I’ve organized learning resources into five main categories.

  1. Introductory Tutorials
  2. Key Research Papers
  3. Hands-on Codes
  4. Online Courses
  5. RLHF Book

Let's get started:

1. Introductory Tutorials

Here the goal is to get a high-level understanding of RLHF. I suggest
you start with Chip Huyen’s blog post on RLHF. It is a great introduction to the topic. then you can continue with rest. I should note that Nathan Lambert is quite active on Twitter about this topic, and I suggest following him if you want to keep updated with RLHF.

Resources:

  1. 📚 20m: Introduction to RLHF by Chip Huyen [Read
    here
    ]
    2. 📺 40m: RLHF: From Zero to ChatGPT by Nathan Lambert [Watch
    here
    ]
    3. 📺 60m: Lecture on RLHF by Hyung Won Chung [See
    here
    ]

2. Key Research Papers

Here are some important research papers that I think are worth reading to get a deeper understanding of RLHF.

Resources:

– 📚 30m: InstructGPT Paper β€” Applying RLHF to a general language
model [Read here]
– 📚 30m: DPO Paper by Rafael Rafailov [Read here]
– 📚 30m: Artificial Intelligence, Values, and Alignment β€” [Read here]

3. Hands-on Codes

Here I am adding some codes that you can run on your local machine (though unfortunately only if you have a lot of GPUs :)) to get a hands-on experience of RLHF. I would like to note that TRL seems to a
nice library that has been used for RLHF, maybe that can be go to go library when it comes to RLHF implementation.

– 📚 2h: Detoxifying a Language Model using PPO [Read here]
– 📚 2h: RLHF with DPO & Hugging Face [Read
here
]
– 📚 1h: TRL Library for Fast Implementationβ€” Minimal and efficient
[Read here]

4. Online Course

If you are looking for a more structured learning experience, I
recommend taking an online course. Here is a course that covers
RLHF:

  • 📚 1h: Reinforcement Learning from Human Feedback by Nikita
    Namjoshi [Course link]

5. RLHF Book

Finally, if you want to dive deep into RLHF, I recommend reading the book [A Little Bit of Reinforcement Learning from Human Feedback (https://rlhfbook.com/book.pdf) by Nathan Lambert. This book is in progress and you can read it online for free.

Conclusion

I hope this study guide helps you get started with Reinforcement
Learning from Human Feedback (RLHF). It seems it becoming more evident that Reinforcement learning is going to play an ever-increasing role in making LLMs better, especially with the release of DeepSeek, which has a pure RL framework for training.

Hope you liked the blog and remember that the key to learning any topic is consistency and being curious!

References:

[1 ]: Ouyang, Long, et al. β€œTraining language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730–27744.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓