Why Is the Bellman Equation So Powerful in RL?

Author(s): Rem E

Originally published on Towards AI.

Go grab a coffee, because what’s coming next might give you a mini headache!

I know it looks scary, but don’t worry, I’ll guide you through it step by step. By the end, it will all make perfect sense!
If you’re not familiar with value functions, make sure to check out this first: Our Neat Value Function.

🧮Bellman Equations

Our dear friend Bellman took one look at our value function and said, ‘Nah, this isn’t perfect.’ He wanted to enhance and simplify it! So, grab your paper and pen (or your iPad), and let’s have some fun with a little math!

Value Function using Return (step 1), Source: Image by the author

Here we are, starting with our neat value function! Now, let’s break it down a little:

Expanded Value Function (step 2), Source: Image by the author

Easy, we just substitute Rₜ with its definition. We’re using ∞ here because it’s the most common case, but if it’s episodic, you can replace it with the finite limit we discussed earlier.

Expanded Value Function (step 3), Source: Image by the author

Here, we take the first term of the return summation out: rₜ₊₁.
Now, instead of starting the summation at (t+1)+k, we shift it to (t+2)+k, skipping the first term. But by doing this, we would miss multiplying by the discount factor starting from t+2. Since k hasn’t changed, we fix this by multiplying the entire summation by γ. If this feels unclear, try breaking it down step by step; it clicks!

-But wait, why can’t we just start k from 1 instead of 0 and avoid all this?

You’re absolutely right! We could. But this is where the math trickery comes in. Writing it this way will do the magic later, when we build up to the Bellman equation. Trust the process!

Now we will break down the expectation. Our value function is the expectation of the random variable return. This expectation accounts for all sources of randomness, as the return depends on the probabilities of actions, states, and rewards (remember, we are considering the general case where the environment is stochastic).

Expectation Definition, Source: Image by the author

Just dropping it here in case you need a refresher.

Expanded Value Function (step 4), Source: Image by the author

First, we handle action randomness. To do this, we expand the expectation over the probability of actions. And what provides us with these action probabilities? The policy function π!
So, to account for the randomness in action selection, we average over all possible actions, weighted by their probability under the policy π. That’s why we sum (loop) over all actions using the policy function.
Finally, since actions are explicitly considered in this expansion, we also need to condition on them; hence, we include aₜ=a in the expectation.

Expanded Value Function (step 5), Source: Image by the author

Second, we handle state randomness. To do this, we expand the expectation over the probability of transitions. And yes, we obtain this from our transition function, smarty! Remember, it was an essential element of an MDP, originally written as P(s′∣s, a). Here, we are using the shorthand notation Pᵃₛₛ′, which represents the probability of transitioning from state s to state s′ when taking action a.
So, to account for the randomness in state transitions, we average over all possible next states, weighted by their transition probabilities for a given action a. That’s why we sum (loop) over all transitions. Notice that this results in a double summation: for each action (from the previous step), we also loop over all possible next states.

Finally, since transitions are explicitly considered in this expansion, we also need to condition on them; hence, we include sₜ₊₁=s′ in the expectation.
The second equation is identical but uses a shorthand for the conditions to make it cleaner.

Expanded Value Function (step 6), Source: Image by the author

Now we’re done handling action and state randomness, leaving only rewards. We won’t expand rewards further here; you’ll see why later.

In the first line, we simply distribute the expectation over the two terms using the linearity of expectation.
If you look closely at the first term, you’ll notice it matches the reward element from the MDP definition: E[rₜ₊₁|sₜ, aₜ, sₜ₊₁]. Here, we use a shorthand notation in the second line: Rᵃₛₛ′ which represents the expected reward when transitioning from state s to state s′ after taking action a.
Finally, in the second term, we take out the discount factor γ from the expectation using the scalar multiplication rule.

If the final line doesn’t click right away, take a moment to review it again, or drop a question in the comments. We still have to add one last touch!

Expanded Value Function (step 7), Source: Image by the author

Our last step: see the part inside the yellow box? Does it look familiar? Yes, it’s almost identical to the value function we started with, but now it’s for the next state sₜ₊₁. It’s like we’re calling the value function on the subsequent state s′. This is exactly what we call a recursive relation, where the function is defined in terms of itself.
And voila! That’s the Bellman equation!

In Qπ(s, a), there’s only one difference: we’ve already fixed the first action a. Now, from the next state s′, we continue following the policy:

Bellman Equation for Action-value Function, Source: Image by the author

The policy is inside because after reaching s′, we still need to average over the next action a′ chosen by the policy.

The Bellman equation says that the value of a state is equal to:

The immediate reward we expect to get now, plus
The discounted value of the next state we will end up in.

In other words, it breaks down the long-term return into “reward now” + “future rewards later”, where the future rewards are themselves computed using the same value function recursively. This recursive nature is what makes it so powerful: instead of looking infinitely ahead, we can express value in terms of one step ahead plus recursion, which simplifies learning and computation in reinforcement learning.

📈🧮Optimal Bellman Equations

Yes, we’ve made it to the Bellman equation, but we’re not done yet! There’s another form called the optimal Bellman equation, which we use to determine the optimal policy.

-How do we compare policies?

The policy that yields the greatest return (value function) is considered the optimal policy. There’s always at least one policy that is better than or equal to all others. We denote this optimal policy as π* and its corresponding optimal value function as V*(s), defined as:

Optimal Value Function, Source: Image by the author

– The optimal value function V*(s) for a state s is the maximum value achievable over all possible policies π, for every state s in the state space S.
– The optimal value function Q*(s, a) for a state s and action a is the maximum value achievable over all possible policies π, for every state s in the state space S, and every state a in the state space A.

For the Bellman equation, there’s only one tweak we need to make to turn it into the optimal Bellman equation:

Optimal Bellman eqaution, Source: Image by the author

The optimal value of state s is equal to the maximum expected return achievable by choosing the best action and following it thereafter.

So, instead of averaging over actions with a policy π, we now pick the single best action that gives the maximum expected return. Here, we only care about the expected return for that best action, rather than considering all actions. Simple and straightforward, right?

For Q*(s, a), the summation over action probabilities is also replaced with a maximization over actions:

Optimal Bellman Equation for Action-value Function, Source: Image by the author

The beauty of the optimal Bellman equation is that it naturally leads to the optimal policy. By taking the max over actions, it directly tells us the best action in each state without needing to search through all policies.

📏🆚✏️Optimality vs Approximation

The optimal Bellman equation defines the true optimal value function by assuming we can perfectly compute the maximum expected return over all actions and states. However, in practice, solving it exactly is often infeasible for large or continuous state spaces. This is where approximation comes in: instead of computing exact values, we estimate them. While this introduces some error, it allows us to efficiently learn near-optimal policies even in complex environments where exact solutions are impossible.

✅ What We’ve Learned…

This was a lot to take in, but look at what you’ve accomplished!

You learned how to express value functions using expectations over actions, states, and rewards in stochastic environments.
We broke down the math step by step to derive the Bellman expectation equations.
We then moved to the optimal Bellman equations, which naturally lead us to the optimal policy π.

And of course, you’ve now seen all four Bellman equations, the foundation of many reinforcement learning algorithms!

The Four Bellman Equations, Source: Image by the author

If you’ve followed along and feel confident, give yourself some credit. This was a deep dive into the core mathematics of RL. Next up, we’ll put these equations to use and finally bring our agent to life!

Thank you all for taking the time to read this, and see you in the next tutorial!

👉Next up: See it all in action!

📚References:

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Why Is the Bellman Equation So Powerful in RL?

Author(s): Rem E

🧮Bellman Equations

📈🧮Optimal Bellman Equations

📏🆚✏️Optimality vs Approximation

✅ What We’ve Learned…

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

Dynamic Programming in Reinforcement Learning

Our First Approach to Solving Reinforcement Learning Problems!

📚References:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Why Is the Bellman Equation So Powerful in RL?

Author(s): Rem E

🧮Bellman Equations

📈🧮Optimal Bellman Equations

📏🆚✏️Optimality vs Approximation

✅ What We’ve Learned…

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

Dynamic Programming in Reinforcement Learning

Our First Approach to Solving Reinforcement Learning Problems!

📚References:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement