
Why Is the Bellman Equation So Powerful in RL?
Author(s): Rem E
Originally published on Towards AI.
Go grab a coffee, because what’s coming next might give you a mini headache!

I know it looks scary, but don’t worry, I’ll guide you through it step by step. By the end, it will all make perfect sense!
If you’re not familiar with value functions, make sure to check out this first: Our Neat Value Function.
🧮Bellman Equations
Our dear friend Bellman took one look at our value function and said, ‘Nah, this isn’t perfect.’ He wanted to enhance and simplify it! So, grab your paper and pen (or your iPad), and let’s have some fun with a little math!

Here we are, starting with our neat value function! Now, let’s break it down a little:

Easy, we just substitute Rₜ with its definition. We’re using ∞ here because it’s the most common case, but if it’s episodic, you can replace it with the finite limit we discussed earlier.

Here, we take the first term of the return summation out: rₜ₊₁.
Now, instead of starting the summation at (t+1)+k, we shift it to (t+2)+k, skipping the first term. But by doing this, we would miss multiplying by the discount factor starting from t+2. Since k hasn’t changed, we fix this by multiplying the entire summation by γ. If this feels unclear, try breaking it down step by step; it clicks!
-But wait, why can’t we just start k from 1 instead of 0 and avoid all this?
You’re absolutely right! We could. But this is where the math trickery comes in. Writing it this way will do the magic later, when we build up to the Bellman equation. Trust the process!
Now we will break down the expectation. Our value function is the expectation of the random variable return. This expectation accounts for all sources of randomness, as the return depends on the probabilities of actions, states, and rewards (remember, we are considering the general case where the environment is stochastic).

Just dropping it here in case you need a refresher.

First, we handle action randomness. To do this, we expand the expectation over the probability of actions. And what provides us with these action probabilities? The policy function π!
So, to account for the randomness in action selection, we average over all possible actions, weighted by their probability under the policy π. That’s why we sum (loop) over all actions using the policy function.
Finally, since actions are explicitly considered in this expansion, we also need to condition on them; hence, we include aₜ=a in the expectation.

Second, we handle state randomness. To do this, we expand the expectation over the probability of transitions. And yes, we obtain this from our transition function, smarty! Remember, it was an essential element of an MDP, originally written as P(s′∣s, a). Here, we are using the shorthand notation Pᵃₛₛ′, which represents the probability of transitioning from state s to state s′ when taking action a.
So, to account for the randomness in state transitions, we average over all possible next states, weighted by their transition probabilities for a given action a. That’s why we sum (loop) over all transitions. Notice that this results in a double summation: for each action (from the previous step), we also loop over all possible next states.
Finally, since transitions are explicitly considered in this expansion, we also need to condition on them; hence, we include sₜ₊₁=s′ in the expectation.
The second equation is identical but uses a shorthand for the conditions to make it cleaner.

Now we’re done handling action and state randomness, leaving only rewards. We won’t expand rewards further here; you’ll see why later.
In the first line, we simply distribute the expectation over the two terms using the linearity of expectation.
If you look closely at the first term, you’ll notice it matches the reward element from the MDP definition: E[rₜ₊₁|sₜ, aₜ, sₜ₊₁]. Here, we use a shorthand notation in the second line: Rᵃₛₛ′ which represents the expected reward when transitioning from state s to state s′ after taking action a.
Finally, in the second term, we take out the discount factor γ from the expectation using the scalar multiplication rule.
If the final line doesn’t click right away, take a moment to review it again, or drop a question in the comments. We still have to add one last touch!

Our last step: see the part inside the yellow box? Does it look familiar? Yes, it’s almost identical to the value function we started with, but now it’s for the next state sₜ₊₁. It’s like we’re calling the value function on the subsequent state s′. This is exactly what we call a recursive relation, where the function is defined in terms of itself.
And voila! That’s the Bellman equation!
In Qπ(s, a), there’s only one difference: we’ve already fixed the first action a. Now, from the next state s′, we continue following the policy:

The policy is inside because after reaching s′, we still need to average over the next action a′ chosen by the policy.
The Bellman equation says that the value of a state is equal to:
- The immediate reward we expect to get now, plus
- The discounted value of the next state we will end up in.
In other words, it breaks down the long-term return into “reward now” + “future rewards later”, where the future rewards are themselves computed using the same value function recursively. This recursive nature is what makes it so powerful: instead of looking infinitely ahead, we can express value in terms of one step ahead plus recursion, which simplifies learning and computation in reinforcement learning.
📈🧮Optimal Bellman Equations
Yes, we’ve made it to the Bellman equation, but we’re not done yet! There’s another form called the optimal Bellman equation, which we use to determine the optimal policy.
-How do we compare policies?
The policy that yields the greatest return (value function) is considered the optimal policy. There’s always at least one policy that is better than or equal to all others. We denote this optimal policy as π* and its corresponding optimal value function as V*(s), defined as:

– The optimal value function V*(s) for a state s is the maximum value achievable over all possible policies π, for every state s in the state space S.
– The optimal value function Q*(s, a) for a state s and action a is the maximum value achievable over all possible policies π, for every state s in the state space S, and every state a in the state space A.
For the Bellman equation, there’s only one tweak we need to make to turn it into the optimal Bellman equation:

The optimal value of state s is equal to the maximum expected return achievable by choosing the best action and following it thereafter.
So, instead of averaging over actions with a policy π, we now pick the single best action that gives the maximum expected return. Here, we only care about the expected return for that best action, rather than considering all actions. Simple and straightforward, right?
For Q*(s, a), the summation over action probabilities is also replaced with a maximization over actions:

The beauty of the optimal Bellman equation is that it naturally leads to the optimal policy. By taking the max over actions, it directly tells us the best action in each state without needing to search through all policies.
📏🆚✏️Optimality vs Approximation
The optimal Bellman equation defines the true optimal value function by assuming we can perfectly compute the maximum expected return over all actions and states. However, in practice, solving it exactly is often infeasible for large or continuous state spaces. This is where approximation comes in: instead of computing exact values, we estimate them. While this introduces some error, it allows us to efficiently learn near-optimal policies even in complex environments where exact solutions are impossible.
✅ What We’ve Learned…
This was a lot to take in, but look at what you’ve accomplished!
- You learned how to express value functions using expectations over actions, states, and rewards in stochastic environments.
- We broke down the math step by step to derive the Bellman expectation equations.
- We then moved to the optimal Bellman equations, which naturally lead us to the optimal policy π.
And of course, you’ve now seen all four Bellman equations, the foundation of many reinforcement learning algorithms!

If you’ve followed along and feel confident, give yourself some credit. This was a deep dive into the core mathematics of RL. Next up, we’ll put these equations to use and finally bring our agent to life!
Thank you all for taking the time to read this, and see you in the next tutorial!
👉Next up: See it all in action!
The Clever Way to Calculate Values, Bellman’s “Secret”
Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”
pub.towardsai.net
👉 Then, take it a step further and see how RL problems are solved!
Dynamic Programming in Reinforcement Learning
Our First Approach to Solving Reinforcement Learning Problems!
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
📚References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.