Build Your Own RLHF LLM — Forget Human Labelers!
Author(s): Tim Cvetko
Originally published on Towards AI.
You know, that thing OpenAI used to make GPT3.5 into ChatGPT? You can do the same without asking strangers to rank statements.
I would never have put my finger that the next big revolution in AI would have happened on the text front. As an early adopter of the BERT models in 2017, I hadn’t exactly been convinced computers could interpret human language with similar granularity and contextuality as people do. Since then, 3 larger breakthroughs have formed the Textual Revolution:
Self-attention: the ability to learn contextual learning of sentences.Large Transformer Models(GPTs) — the ability to learn from massive corpora of data and build conversational awareness.Reinforcement Learning from Human Feedback(RLHF) — the ability to enhance LLM performance with human preference. However, this method is not easily replicable due to the extensive need for human labelers.
Forget Human Labelers!
Image by AuthorHow GPT-3.5 used RLHF to reinforce the LLM to make it ChatGPTComplete Code Walkthrough: Train Your Own RLHF ModelComplete Code Walkthrough: How to make the LLM Reinforce Itself Without Human Labelers, i.e Self-Play LLMs
Reinforcement learning from human feedback(RLHF) refers to using human labels as a reward policy the LLM uses to evaluate itself. Here’s how people act as judges:
Suppose we have a post from Reddit: “The cat is flying through the air”:Two summaries are selected for evaluationA human judge decides which is a better summary… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI