
Implementing the Value Function the Monte Carlo Way
Author(s): Rem E
Originally published on Towards AI.
Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo style
This tutorial builds directly on Tutorial 4, so make sure to check that out first if you haven’t already!
And if you’re new to Reinforcement Learning or Value Functions, be sure to read Our Neat Value Function before diving in.
🔍What You’ll Learn
We’ll continue working on our maze problem using the RL framework we introduced earlier. This time, the focus shifts to the agent’s side; you’ll see exactly how returns and state values are calculated in code using the Monte Carlo approach!
🛠️Project Setup
The code for this tutorial is available in this GitHub repository.
If you haven’t already, follow the instructions in the README.md
file to get started.
If you’ve already cloned the repo, make sure to pull the latest changes to access the new tutorial (tutorial-5
).
Once everything is set up, you’ll notice it follows the same folder structure as tutorial-4
:

We’ve re-added the images because we’ll be running the code in this tutorial.
🌊Before You Dive In…
In this tutorial, we’re going to calculate and update returns and state values after the episode finishes. This approach, where values are only updated at the end of an episode, is known as the Monte Carlo style. (We’ll dive even deeper into Monte Carlo methods later on!)
But just a heads-up: in real-world reinforcement learning, this isn’t the most commonly used method. Instead, values are typically updated continuously during the episode, a topic we’ll explore in upcoming tutorials.
So hold on to that curiosity for later. For now, let’s focus on bringing our value function to life!
Last Note: Our environment is deterministic, so instead of using the expected return, we can directly use the average return from actual episodes.
🤖Agent Class Implementation
Go to the Agent
class and let’s get started!
class Agent:
def __init__(self, env: gym.Env):
self.env = env
self.i = 0
self.j = 0
self.episode = []
self.returns = np.full((env.size, env.size), 0, dtype=float)
self.counts = np.full((env.size, env.size), 0, dtype=float)
self.values = np.full((env.size, env.size), np.nan, dtype=float)
In the __init__
function, we initialize all the variables we’ll need to calculate returns and values the Monte Carlo way:
env
: The environment is injected here so we can access its attributes.i
: A time-step counter used during each episode.j
: An episode counter, useful for tracking which trajectory is currently running.episode
: A list that stores state–reward pairs for the current episode. Once the episode ends, we’ll use this to go backward and calculate returns.returns
: A 2D matrix (same size as the maze) initialized to0
. It accumulates the total return for each state.counts
: Also a 2D matrix, tracking how many times each state has been visited. This will help us average the returns properly.values
: A 2D matrix initialized withnp.nan
. It holds the actual estimated state values, which we’ll compute as the average of returns.
def _policy(self, _s):
trajectories = [[2, 2, 2, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 2, 2, 2, 2, 2],
[2, 2, 2, 2, 2, 2, 0, 0, 3, 3, 0, 0, 3, 3, 0, 0, 2, 2, 2, 2, 2, 0],
[2, 2, 2, 2, 0, 0, 3, 3, 3, 0, 0, 2, 0, 0, 0, 2, 2, 2, 2, 2],
[2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 1, 1, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3, 3, 3, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0],
[3, 3]
]
action = trajectories[self.j][self.i]
if self.i < len(trajectories[self.j]) - 1:
self.i +=1
else:
self.i = 0
self.j += 1
return action
def get_action(self, s):
return self._policy(s)
In this tutorial, we won’t use a random policy because it might run indefinitely, and honestly, we don’t have the time to wait for that.
We also won’t use greedy or softmax policies either, since they typically depend on value updates during an episode. But in Monte Carlo style, value updates only happen after the episode ends. We’re still not doing any learning yet.
In the _policy()
function, we’ll define 4 fixed paths (trajectories) manually. The agent will follow these predefined trajectories to reach the goal, so we can calculate the value of each visited state. trajectories
is a 2D list, where each row represents one full episode (a trajectory). Each number inside a row represents an action the agent should take at each step to reach the goal. I manually chose these action sequences to ensure the agent reaches the goal efficiently.
We use two counters:
i
: (step counter) is incremented at every step and reset when the episode ends.j
: (episode counter ) is incremented after every episode.
So, in get_action()
, we simply call _policy()
to return the next action based on the current trajectory j
and step i
.
Note: The last trajectory
[3, 3]
is just a dummy. It’s meant to give a frame to observe the final updates. The agent won’t move anyway because it tries to go right into a wall.
def update(self, s, a, r, nxt_s, over):
self.episode.append((s, r))
if over:
self._V()
In the update()
function, we’re not updating at every step; instead, we’re collecting the experience so we can update after the episode finishes.
At each step, we store the state s
and the reward r
in the episode
list. This will allow us to go backward through the episode later to compute the returns. When the episode ends (over == True
), we call self._V()
to calculate the returns and values for all the visited states in the episode.
This design is important for Monte Carlo methods, where value updates are only performed after the full episode is complete, no updates happen during the episode.
def get_values(self):
return self.values
Although the values are stored inside the Agent, our render function lives inside the Environment.
To solve this, we inject the values into the environment when needed, using this function. So, this function just gives external access to the learned values for rendering purposes only.
def _V(self, _s=None):
R = 0
for s, r in reversed(self.episode):
x, y = s
R += r
self.returns[x][y] += R
self.counts[x][y] += 1
self.values[x][y] = self.returns[x][y] / self.counts[x][y]
self.episode = []
Now we reach the heart of this tutorial: the _V()
method where we actually calculate state values. Let’s break it down:
R = 0:
This is our accumulated return, initialized at the beginning of the episode.
We loop backward through the recorded episode reversed(episode)
. For each (state, reward)
pair:
- Accumulate the reward into
R
. - Add R to the returns matrix for that state.
- Increment visit count for that state.
- Average all returns so far to get the current estimated value of that state.
Finally, we reset the episode memory.
If this algorithm still feels unclear, check the animation below for a visual walkthrough example of how values are updated over time!

Here, the trajectory represents state visits, and each update corresponds to the reward received when walking backwards from the end of the episode.
- The final state in the trajectory is state 0 (the goal), which gives us a return of 24. And its value is also 24.
- Moving backwards to state 1, we accumulate a reward of -1, so the return becomes 23. Since this is the first visit to state 1, the count is 1, so the value becomes 23/1 = 23.
- Then we go to state 2, again with a reward of -1, so the return is now 22.
- Finally, we go back to state 1 again. Here’s where it gets interesting:
- The state was visited again, so the count increases to 2.
- The return now is 21, calculated from the previous return for state 2 (22–1).
This makes sense because if you trace the full return path from the end: 24−1−1−1=21, you still arrive at the same return value.
But here, we’re using a single variableR
to accumulate rewards step-by-step in reverse, updating the return for each visited state as we move backward.
This is a great example of how multiple experiences of the same state can affect its estimated value. Even though this example uses only one episode, it demonstrates how Monte Carlo methods allow agents to average over different experiences to form better value estimates.
At first glance, it might seem noisy, but when the agent experiences thousands of episodes, these averages begin to truly reflect the value of each state.
▶️Run Code
env = gymnasium.make("env/Maze-v0", render_mode="human", reg_r=-1)
agent = Agent(env.unwrapped)
values = agent.get_values()
env.unwrapped.set_values(values)
total_episodes = 5
for ep in range(total_episodes):
state, _ = env.reset(seed=1221)
episode_over = False
t = 0
while not episode_over:
action = agent.get_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
episode_over = terminated or truncated
agent.update(state['agent'], action, reward, next_state['agent'], episode_over)
state = next_state
t += 1
print(f"Episode {ep + 1} finished")
env.close()
Same run as before, but this time we seeded the environment with seed=1221
to ensure the maze is identical on each run. We also set the regular step reward to -1.

As you observe these 4 trajectories, pay attention to how the state values evolve as each episode completes. If it’s hard to spot the changes, try slowing down the FPS (in Env class), you’ll notice how the values get updated.
In the last trajectory, our poor robot got a bit lost, which caused the returns (and hence the values) of the visited states to drop significantly.
But that’s the point; with thousands of episodes, these values begin to stabilize, and the averages start to reflect the true long-term value of each state!
🎯 Initial Values Matter!
One last thing: try changing the reg_r
(regular reward) to 0 and observe how the values change.

-What happened? Why are all the values 24?
Well, here’s why:
When we set reg_r = -1
, we’re essentially penalizing every step. It’s like telling the agent:
“You’re losing points for every move unless you reach the goal!”
That forces the agent to find the shortest path, minimizing the penalty.
But when we set reg_r = 0
, we’re saying:
“Take as many steps as you want, it doesn’t matter; just reach the goal!”
So every state that eventually leads to the goal ends up with the same return: 24
(the reward at the goal). Since all paths lead to the goal!
Isn’t it cool how math calculations directly reflect these real-world scenarios?
This is why defining your RL problem carefully is super important; those initial settings (like rewards) shape how your agent learns and behaves.
So next time you’re designing a reward system, think about what behavior you’re encouraging. It might surprise you!
✅ What We’ve Learned…
This tutorial was a bit long, but pretty straightforward (hopefully!)
We learned how to calculate returns and state values in our maze problem by setting manual trajectories for the agent to follow. After each episode finished, we applied Monte Carlo-style updates to show how the values evolve.
We also explored a visual example of how value calculations happen step by step, and finally saw how initial values and reward settings can significantly affect the agent’s behavior. This highlights just how important the problem definition is in reinforcement learning. We didn’t include gamma
just yet, to keep the code simple, but next time, we’ll explore a clever way to introduce the discount factor and more!
Check out the new tutorial on how we can improve the value function even more:
The Clever Way to Calculate Values, Bellman’s “Secret”
Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.