Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Why Is the Bellman Equation So Powerful in RL?
Latest   Machine Learning

Why Is the Bellman Equation So Powerful in RL?

Author(s): Rem E

Originally published on Towards AI.

Go grab a coffee, because what’s coming next might give you a mini headache!

Why Is the Bellman Equation So Powerful in RL?
Our Agent Still Struggling, Source: Generated by ChatGPT

I know it looks scary, but don’t worry, I’ll guide you through it step by step. By the end, it will all make perfect sense!
If you’re not familiar with value functions, make sure to check out this first: Our Neat Value Function.

🧮Bellman Equations

Our dear friend Bellman took one look at our value function and said, ‘Nah, this isn’t perfect.’ He wanted to enhance and simplify it! So, grab your paper and pen (or your iPad), and let’s have some fun with a little math!

Value Function using Return (step 1), Source: Image by the author

Here we are, starting with our neat value function! Now, let’s break it down a little:

Expanded Value Function (step 2), Source: Image by the author

Easy, we just substitute Rₜ​ with its definition. We’re using here because it’s the most common case, but if it’s episodic, you can replace it with the finite limit we discussed earlier.

Expanded Value Function (step 3), Source: Image by the author

Here, we take the first term of the return summation out: rₜ₊₁.
Now, instead of starting the summation at (t+1)+k, we shift it to (t+2)+k, skipping the first term. But by doing this, we would miss multiplying by the discount factor starting from t+2. Since k hasn’t changed, we fix this by multiplying the entire summation by γ. If this feels unclear, try breaking it down step by step; it clicks!

-But wait, why can’t we just start k from 1 instead of 0 and avoid all this?

You’re absolutely right! We could. But this is where the math trickery comes in. Writing it this way will do the magic later, when we build up to the Bellman equation. Trust the process!

Now we will break down the expectation. Our value function is the expectation of the random variable return. This expectation accounts for all sources of randomness, as the return depends on the probabilities of actions, states, and rewards (remember, we are considering the general case where the environment is stochastic).

Expectation Definition, Source: Image by the author

Just dropping it here in case you need a refresher.

Expanded Value Function (step 4), Source: Image by the author

First, we handle action randomness. To do this, we expand the expectation over the probability of actions. And what provides us with these action probabilities? The policy function π!
So, to account for the randomness in action selection, we average over all possible actions, weighted by their probability under the policy π. That’s why we sum (loop) over all actions using the policy function.
Finally, since actions are explicitly considered in this expansion, we also need to condition on them; hence, we include aₜ=a in the expectation.

Expanded Value Function (step 5), Source: Image by the author

Second, we handle state randomness. To do this, we expand the expectation over the probability of transitions. And yes, we obtain this from our transition function, smarty! Remember, it was an essential element of an MDP, originally written as P(s′s, a). Here, we are using the shorthand notation Pᵃₛₛ′, which represents the probability of transitioning from state s to state s′ when taking action a.
So, to account for the randomness in state transitions, we average over all possible next states, weighted by their transition probabilities for a given action a. That’s why we sum (loop) over all transitions. Notice that this results in a double summation: for each action (from the previous step), we also loop over all possible next states.

Finally, since transitions are explicitly considered in this expansion, we also need to condition on them; hence, we include sₜ₊₁=s′ in the expectation.
The second equation is identical but uses a shorthand for the conditions to make it cleaner.

Expanded Value Function (step 6), Source: Image by the author

Now we’re done handling action and state randomness, leaving only rewards. We won’t expand rewards further here; you’ll see why later.

In the first line, we simply distribute the expectation over the two terms using the linearity of expectation.
If you look closely at the first term, you’ll notice it matches the reward element from the MDP definition: E[rₜ₊₁|sₜ, aₜ, sₜ₊₁]. Here, we use a shorthand notation in the second line: Rᵃₛₛ′ which represents the expected reward when transitioning from state s to state s′ after taking action a.
Finally, in the second term, we take out the discount factor γ from the expectation using the scalar multiplication rule.

If the final line doesn’t click right away, take a moment to review it again, or drop a question in the comments. We still have to add one last touch!

Expanded Value Function (step 7), Source: Image by the author

Our last step: see the part inside the yellow box? Does it look familiar? Yes, it’s almost identical to the value function we started with, but now it’s for the next state sₜ₊₁. It’s like we’re calling the value function on the subsequent state s′. This is exactly what we call a recursive relation, where the function is defined in terms of itself.
And voila! That’s the Bellman equation!

In (s, a), there’s only one difference: we’ve already fixed the first action a. Now, from the next state s′, we continue following the policy:

Bellman Equation for Action-value Function, Source: Image by the author

The policy is inside because after reaching s′, we still need to average over the next action a′ chosen by the policy.

The Bellman equation says that the value of a state is equal to:

  • The immediate reward we expect to get now, plus
  • The discounted value of the next state we will end up in.

In other words, it breaks down the long-term return into “reward now” + “future rewards later”, where the future rewards are themselves computed using the same value function recursively. This recursive nature is what makes it so powerful: instead of looking infinitely ahead, we can express value in terms of one step ahead plus recursion, which simplifies learning and computation in reinforcement learning.

📈🧮Optimal Bellman Equations

Yes, we’ve made it to the Bellman equation, but we’re not done yet! There’s another form called the optimal Bellman equation, which we use to determine the optimal policy.

-How do we compare policies?

The policy that yields the greatest return (value function) is considered the optimal policy. There’s always at least one policy that is better than or equal to all others. We denote this optimal policy as π* and its corresponding optimal value function as V*(s), defined as:

Optimal Value Function, Source: Image by the author

– The optimal value function V*(s) for a state s is the maximum value achievable over all possible policies π, for every state s in the state space S.
– The optimal value function Q
*(s, a) for a state s and action a is the maximum value achievable over all possible policies π, for every state s in the state space S, and every state a in the state space A.

For the Bellman equation, there’s only one tweak we need to make to turn it into the optimal Bellman equation:

Optimal Bellman eqaution, Source: Image by the author

The optimal value of state s is equal to the maximum expected return achievable by choosing the best action and following it thereafter.

So, instead of averaging over actions with a policy π, we now pick the single best action that gives the maximum expected return. Here, we only care about the expected return for that best action, rather than considering all actions. Simple and straightforward, right?

For Q*(s, a), the summation over action probabilities is also replaced with a maximization over actions:

Optimal Bellman Equation for Action-value Function, Source: Image by the author

The beauty of the optimal Bellman equation is that it naturally leads to the optimal policy. By taking the max over actions, it directly tells us the best action in each state without needing to search through all policies.

📏🆚✏️Optimality vs Approximation

The optimal Bellman equation defines the true optimal value function by assuming we can perfectly compute the maximum expected return over all actions and states. However, in practice, solving it exactly is often infeasible for large or continuous state spaces. This is where approximation comes in: instead of computing exact values, we estimate them. While this introduces some error, it allows us to efficiently learn near-optimal policies even in complex environments where exact solutions are impossible.

✅ What We’ve Learned…

This was a lot to take in, but look at what you’ve accomplished!

  • You learned how to express value functions using expectations over actions, states, and rewards in stochastic environments.
  • We broke down the math step by step to derive the Bellman expectation equations.
  • We then moved to the optimal Bellman equations, which naturally lead us to the optimal policy π.

And of course, you’ve now seen all four Bellman equations, the foundation of many reinforcement learning algorithms!

The Four Bellman Equations, Source: Image by the author

If you’ve followed along and feel confident, give yourself some credit. This was a deep dive into the core mathematics of RL. Next up, we’ll put these equations to use and finally bring our agent to life!

Thank you all for taking the time to read this, and see you in the next tutorial!

👉Next up: See it all in action!

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

pub.towardsai.net

👉 Then, take it a step further and see how RL problems are solved!

Dynamic Programming in Reinforcement Learning

Our First Approach to Solving Reinforcement Learning Problems!

pub.towardsai.net

As always… stay curious, stay coding, and stay tuned!

📚References:

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.