
The Clever Way to Calculate Values, Bellman’s “Secret”
Author(s): Rem E
Originally published on Towards AI.
Tutorial 6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”
I know the Bellman equations aren’t really a secret, but few people truly know how to use them. Do you want to be one of them?
This tutorial builds directly on Tutorial 5, so check that out first if you haven’t already!
And if you’re new to value functions or the Bellman equation, be sure to Why Is the Bellman Equation So Powerful in RL? before diving in.
🔍 What You’ll Learn
We’re moving on from the old, boring Monte Carlo style and listening to Bellman’s advice on how to update our values as the agent moves.
In this tutorial, we’ll modify the agent code so it updates values step-by-step during the agent’s journey through the maze, just like a real RL agent interacting with its environment.
Since we haven’t reached the learning algorithms yet (again), we’ll still use the manual trajectories we defined earlier, but improve how the value and update functions work to reflect this online updating process.
🛠️Project Setup
The code for this tutorial is available in this GitHub repository.
If you haven’t already, follow the instructions in the README.md
file to get started.
If you’ve already cloned the repo, make sure to pull the latest changes to access the new tutorial (tutorial-6
).
Once everything is set up, you’ll notice it follows the same folder structure as tutorial-5
:

🌊Before You Dive In…
In the theoretical part of this tutorial (check the intro!), we explained the Bellman equation for the general case with stochastic environments.
But you know our maze problem is deterministic: in transitions, actions, and rewards.
So in this tutorial, we’re going to use a simplified Bellman formula that fits deterministic cases perfectly:

-What, you deleted everything we learned?
Yeah, unfortunately, our example is too simple for the fancy full Bellman equation.
Since actions and transitions are deterministic, we don’t need those summations over actions and next states anymore. We just add the immediate reward to the value of the next state multiplied by gamma.
Yes, we’re going to use gamma here!
But don’t be sad, we’re still using the heart of Bellman’s equation: recursion!
And I promise you, once our poor robot learns to navigate on its own, we’ll explore a stochastic example to fully capture the beauty of Bellman equations and related concepts in stochastic environments.
That’s a promise!
🤖Agent Class Implementation
class Agent:
def __init__(self, env: gym.Env):
self.env = env
self.gamma = self.env.gamma
self.i = 0
self.j = 0
self.values = np.full((env.size, env.size), np.nan, dtype=float)
That’s all we need for now! We get the discount factor gamma
from the environment, so we can use it here.
def _V(self, s):
v = self.values[s[0], s[1]]
if np.isnan(v):
return 0.0
return v
The _V()
function is a simple way to access the value for a state. If it’s the first time visiting that state (value == NaN
), we return 0 so the calculations can work smoothly.
def update(self, s, a, r, nxt_s, over):
self.values[s[0], s[1]] = r + self.gamma * self._V(nxt_s)
Here’s where we actually use the Bellman equation!
Notice that the update()
function is called after each step. This line simply assigns the value of the current state to:
The immediate reward r
, plus the discounted value of the next state.
This matches exactly the simplified Bellman formula we introduced earlier (check Before You Dive In).
And that’s it! We’re done. The rest of the code stays the same as in the previous tutorial.
Now, head over to the tutorial directory and run the code, watch how the values update as the agent moves!
Notice how the reward propagates more slowly here compared to the Monte Carlo method. That’s because this style updates values step-by-step as the agent moves, rather than waiting until the end of an episode.
It’s more realistic since it mimics the way real agents (and even animals!) learn from experience, updating their understanding continuously as they go, not just after the whole journey finishes.
This on-the-fly updating is a key building block for many powerful RL algorithms to come!
✅ What We’ve Learned…
A tiny but powerful tutorial, all thanks to the Bellman equation, making our lives easier!
We learned how to improve our agent’s code so it updates state values step-by-step using the Bellman equation. And yeah, that’s it for today, see you next time!
👉 In the next tutorial, we’ll finally apply our very first method to solve an RL problem: Dynamic Programming!
Watch Our Agent Learn
Tutorial 7: Implementing Dynamic Programming for our maze problem
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.