
Let's breakdown Reinforcement Learning! — How it connects with the latest LLM reasoning models such as DeepSeek R1
Last Updated on February 10, 2025 by Editorial Team
Author(s): Ramendra Singla
Originally published on Towards AI.
Around two years ago, I made the tough decision to drop out of OMSCS Gatech at Georgia Tech, where I was pursuing an M.S. in Computer Science (Machine Learning). The reason? I deliberately chose Reinforcement Learning as my first subject — not because I had to, but because I wanted to challenge myself with something entirely new. At the time, it felt like a great decision. Boy, was I wrong!
Despite having a solid foundation in AI and ML and working as a Machine Learning Engineer, the sheer depth of the math in RL was overwhelming. Fast forward to 2025, and RL is now a key player in the race toward Artificial General Intelligence.
This blog is my attempt to break down its complex math, making it more accessible for future students and enthusiasts.**
Intuition
Mastering the Art of Making Coffee ☕
As a coffee person, its always been fascinated me to find that PERFECT recipe for a coffee. Imagine now that we don’t have a fixed recipe — you have to experiment with different ingredients and techniques to find what works best
1. States (S) — “Where am I in the coffee-making process?”
At every step, you are in a specific state:
- No coffee made yet☕
- Coffee beans ground but not brewed
- Coffee brewed but too bitter
2. Actions (A) — “What can I do next?”
You have different actions available:
- Add more coffee grounds
- Use a different brewing method (French press, espresso, pour-over)
- Adjust the water temperature
- Add sugar or milk
3. Rewards (R) — “Did my coffee taste good or bad?”
Each action has consequences:
- Too weak coffee? → -5 points (Bad experience 🤢)
- Perfectly balanced coffee? → +10 points (Delicious! ☕😍)
- Too bitter? → -3 points (Might need sugar or milk!)
The goal is to maximize rewards — making the best-tasting coffee.
4. Exploration vs. Exploitation — “Should I try new methods or stick to what I know?”
- At first, you explore different coffee-making techniques: trying various beans, brewing times, and temperatures.
- Over time, you exploit the best techniques (FYI, Using 90°C water and freshly ground beans makes the best coffee 😊).
5. The Goal — “Becoming a Coffee Master!”
As you make more cups of coffee, you refine your process, learning which actions lead to the best outcome. Eventually, you develop an optimal policy — a strategy that consistently results in a great cup of coffee.
How This Relates to AI & RL Algorithms
- You (RL agent) learn from experience by trying different coffee-making techniques.
- The taste of the coffee provides feedback (rewards), guiding your future decisions.
- Over time, you optimize your actions to make the perfect cup every time. Essentially developing an optimized policy that helps us
This is exactly how AI agents learn tasks like playing chess, optimizing supply chains, or training robotic arms — through repeated trials, feedback, and gradual improvement.
In essence, RL is about experimenting and refining a process — just like learning to make the perfect cup of coffee.☕🔥
Markov Decision Process
A Reinforcement Learning problem is typically modeled as a Markov Decision Process (MDP), defined by:
- States (S): The different stages in coffee-making(as discussed above)
- Actions (A): The choices available at each step(as discussed above)
- Rewards (R): The feedback received after making a decision(as discussed above)
- Transition Probabilities (P) — “The likelihood of moving from one state to another after taking an action”
If you boil water at 90°C, there is a high probability of a good brew; or If you use boiling water (100°C), there is a chance of over-extraction, making the coffee bitter; or If you brew for too short a time, you might end up with under-extracted coffee.
- Discount Factor (γ) — “How much do I care about future rewards?”
The discount factor γ (0 < γ ≤ 1) determines how much future rewards influence your decision.
– If γ = 0, you only care about immediate rewards (i.e., you just want coffee fast).
– If γ is close to 1, you care about long-term results (i.e., you experiment more to make the best coffee in the long run).
A high γ means you’re willing to invest time in learning a better brewing process rather than rushing.
Bellman Equation
Lets try interconnecting all the parameters defined.
The value function estimates how good a given state is in terms of future rewards.
State-Value Function (V(s)):“How much reward can I expect if I start from this state and make optimal choices?”
Mathematically,
which means:
- The value of a state s is the expected reward R_t plus the discounted value of the next state s’.
- This helps estimate whether it’s better to grind beans first or boil water first.
Action-Value Function (Q(s, a)): “How much reward can I expect if I take action a in state s?”
Q(s,a) estimates how good an action a is in a given state s by considering both immediate rewards and future rewards;
Example: If you choose espresso over French press, Q(s,a) will reflect whether this leads to a better coffee experience in the long run.
It helps compare actions by considering their impact on future states and rewards, guiding optimal decision-making.
Example: Using coarse grind vs. fine grind — if fine grind consistently leads to richer coffee, Q(s,a) for “fine grind” will be higher, making it the preferred action.
Exploration vs. Exploitation
At the start, you have no idea what makes the best coffee.
- Exploration: Trying different methods (experimenting with grind size, brewing time).
- Exploitation: Once you find a good recipe, you use it consistently.
A common RL technique is the ε-greedy strategy, where:
- With probability ε, you try something new (exploration).
- With probability 1 — ε, you choose the best-known option (exploitation).
Policy Optimization — “Finding the perfect recipe…”
A policy (π) is your strategy for making coffee, mapping each state to the best action.
- A bad policy: Always using boiling water (100°C), leading to bitter coffee.
- A good policy: Choosing the right grind, temperature, and brewing method for great results.
Policy Gradient Methods (like PPO, Q-Learning, GRPO) help refine this process:
- Sample different brewing strategies.
- Evaluate each one based on taste (reward).
- Adjust the approach to maximise future rewards.
Group Relative Policy Optimization (GRPO)
When making the perfect cup of coffee, the best approach isn’t always about following a rigid recipe — it’s about comparing different brews and choosing the best one. This is exactly how Group Relative Policy Optimization (GRPO) works in reinforcement learning. Instead of relying on an absolute scoring system, GRPO refines decision-making by ranking actions relative to one another within a sampled group. By focusing on relative comparisons rather than estimating a separate value function, GRPO efficiently optimizes strategies — whether it’s perfecting coffee recipes or training large-scale AI models like DeepSeek
1. Define the State:
The current state encompasses all variables involved in brewing, such as water temperature, grind size, brewing method, and time.
2. Sample a Group of Actions:
From the current state, select a group of different brewing actions to test. For example:
Action A: Use a medium grind with a French press at 95°C for 4 minutes.
Action B: Use a fine grind with an espresso machine at 93°C for 30 seconds.
Action C: Use a coarse grind with a pour-over at 90°C for 3 minutes.
3. Evaluate the Actions:
Brew coffee using each action and assign a reward based on the taste quality. For instance:
Action A: Reward = 7/10
Action B: Reward = 8/10
Action C: Reward = 6/10
4. Calculate Relative Advantages:
Compute the mean (μ) and standard deviation (σ) of the rewards:
μ = (7 + 8 + 6) / 3 = 7
σ ≈ 1
Determine the advantage of each action relative to the group:
Advantage(A) = (7–7) / 1 = 0
Advantage(B) = (8–7) / 1 = 1
Advantage(C) = (6–7) / 1 = -1
5. Update the Policy:
Adjust your brewing strategy to favor actions with higher relative advantages. In this case, Action B (fine grind with an espresso machine) has the highest advantage and should be preferred in future brewing attempts.
6. Iterate the Process:
Repeat the sampling, evaluation, and policy update steps to continually refine your brewing technique, progressively moving towards the optimal coffee recipe.
One particularly significant advancement is Group Relative Policy Optimization (GRPO), which has proven highly effective in training Large Language Models (LLMs) for reasoning tasks by removing reliance on a value function. Traditional RL methods like PPO depend on critic-based value estimation, which can introduce high variance and instability due to bias in reward scaling and error propagation from imperfect value functions. GRPO eliminates these issues by ranking sampled responses within a batch rather than estimating absolute values, ensuring gradient updates are more stable and less sensitive to noisy reward signals. This makes GRPO particularly effective for training models like DeepSeekMath, where precise step-by-step reasoning is required, as it prevents reward hacking(optimizing for misleading absolute scores) and instead encourages models to consistently improve relative performance across sampled outputs. By focusing on comparative optimization, GRPO leads to more robust convergence, improved sample efficiency, and better generalization in LLM reasoning tasks.
For a more detailed explanation of GRPO, you can refer to this resource: Group Relative Policy Optimization (GRPO) Illustrated Breakdown & Explanation. Thanks to Ebrahim Pichka for these simplified explanations to GRPO and how its being used for LLM Reasoning training
Conclusion
Reinforcement Learning, at its core, mirrors how we learn from experience — whether it’s brewing the perfect cup of coffee, learning to drive, or mastering any skill through trial and feedback. By breaking down RL into real-world analogies, we can demystify its complexities and make key concepts like states, actions, rewards, and policy optimization more intuitive.
The journey of understanding RL doesn’t end here — it’s an ever-evolving field with exciting applications ahead. The next time you experiment with something new, remember: you’re training your own RL model, one experience at a time.🚀
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI