Implementing the Value Function the Monte Carlo Way

Author(s): Rem E

Originally published on Towards AI.

Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo style

This tutorial builds directly on Tutorial 4, so make sure to check that out first if you haven’t already!
And if you’re new to Reinforcement Learning or Value Functions, be sure to read Our Neat Value Function before diving in.

🔍What You’ll Learn

We’ll continue working on our maze problem using the RL framework we introduced earlier. This time, the focus shifts to the agent’s side; you’ll see exactly how returns and state values are calculated in code using the Monte Carlo approach!

🛠️Project Setup

The code for this tutorial is available in this GitHub repository.
If you haven’t already, follow the instructions in the README.md file to get started.
If you’ve already cloned the repo, make sure to pull the latest changes to access the new tutorial (tutorial-5).
Once everything is set up, you’ll notice it follows the same folder structure as tutorial-4:

Implementing the Value Function the Monte Carlo Way — Tutorial-5 Folder Structure, Source: Image by the author

We’ve re-added the images because we’ll be running the code in this tutorial.

🌊Before You Dive In…

In this tutorial, we’re going to calculate and update returns and state values after the episode finishes. This approach, where values are only updated at the end of an episode, is known as the Monte Carlo style. (We’ll dive even deeper into Monte Carlo methods later on!)
But just a heads-up: in real-world reinforcement learning, this isn’t the most commonly used method. Instead, values are typically updated continuously during the episode, a topic we’ll explore in upcoming tutorials.
So hold on to that curiosity for later. For now, let’s focus on bringing our value function to life!

Last Note: Our environment is deterministic, so instead of using the expected return, we can directly use the average return from actual episodes.

🤖Agent Class Implementation

Go to the Agent class and let’s get started!

class Agent:
 def __init__(self, env: gym.Env):
 self.env = env
 self.i = 0
 self.j = 0
 self.episode = []
 self.returns = np.full((env.size, env.size), 0, dtype=float)
 self.counts = np.full((env.size, env.size), 0, dtype=float)
 self.values = np.full((env.size, env.size), np.nan, dtype=float)

In the __init__ function, we initialize all the variables we’ll need to calculate returns and values the Monte Carlo way:

env: The environment is injected here so we can access its attributes.
i: A time-step counter used during each episode.
j: An episode counter, useful for tracking which trajectory is currently running.
episode: A list that stores state–reward pairs for the current episode. Once the episode ends, we’ll use this to go backward and calculate returns.
returns: A 2D matrix (same size as the maze) initialized to 0. It accumulates the total return for each state.
counts: Also a 2D matrix, tracking how many times each state has been visited. This will help us average the returns properly.
values: A 2D matrix initialized with np.nan. It holds the actual estimated state values, which we’ll compute as the average of returns.

def _policy(self, _s):
 trajectories = [[2, 2, 2, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 2, 2, 2, 2, 2],
 [2, 2, 2, 2, 2, 2, 0, 0, 3, 3, 0, 0, 3, 3, 0, 0, 2, 2, 2, 2, 2, 0],
 [2, 2, 2, 2, 0, 0, 3, 3, 3, 0, 0, 2, 0, 0, 0, 2, 2, 2, 2, 2],
 [2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 3, 3, 3, 1, 1, 2, 2, 2, 0, 0, 3, 3, 3, 3, 3, 3, 3, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0],
 [3, 3]
 ]
 
 action = trajectories[self.j][self.i]
 if self.i < len(trajectories[self.j]) - 1:
 self.i +=1
 else:
 self.i = 0
 self.j += 1
 return action
 
def get_action(self, s):
 return self._policy(s)

In this tutorial, we won’t use a random policy because it might run indefinitely, and honestly, we don’t have the time to wait for that.
We also won’t use greedy or softmax policies either, since they typically depend on value updates during an episode. But in Monte Carlo style, value updates only happen after the episode ends. We’re still not doing any learning yet.

In the _policy() function, we’ll define 4 fixed paths (trajectories) manually. The agent will follow these predefined trajectories to reach the goal, so we can calculate the value of each visited state.
trajectories is a 2D list, where each row represents one full episode (a trajectory). Each number inside a row represents an action the agent should take at each step to reach the goal. I manually chose these action sequences to ensure the agent reaches the goal efficiently.
We use two counters:

i: (step counter) is incremented at every step and reset when the episode ends.
j: (episode counter ) is incremented after every episode.

So, in get_action(), we simply call _policy() to return the next action based on the current trajectory j and step i.

Note: The last trajectory [3, 3] is just a dummy. It’s meant to give a frame to observe the final updates. The agent won’t move anyway because it tries to go right into a wall.

def update(self, s, a, r, nxt_s, over):
 self.episode.append((s, r))
 if over:
 self._V()

In the update() function, we’re not updating at every step; instead, we’re collecting the experience so we can update after the episode finishes.
At each step, we store the state s and the reward r in the episode list. This will allow us to go backward through the episode later to compute the returns. When the episode ends (over == True), we call self._V() to calculate the returns and values for all the visited states in the episode.

This design is important for Monte Carlo methods, where value updates are only performed after the full episode is complete, no updates happen during the episode.

def get_values(self):
 return self.values

Although the values are stored inside the Agent, our render function lives inside the Environment.
To solve this, we inject the values into the environment when needed, using this function. So, this function just gives external access to the learned values for rendering purposes only.

def _V(self, _s=None):
 R = 0
 for s, r in reversed(self.episode):
 x, y = s
 R += r 
 self.returns[x][y] += R
 self.counts[x][y] += 1
 self.values[x][y] = self.returns[x][y] / self.counts[x][y]
 
 self.episode = []

Now we reach the heart of this tutorial: the _V() method where we actually calculate state values. Let’s break it down:

R = 0: This is our accumulated return, initialized at the beginning of the episode.
We loop backward through the recorded episode reversed(episode). For each (state, reward) pair:

Accumulate the reward into R.
Add R to the returns matrix for that state.
Increment visit count for that state.
Average all returns so far to get the current estimated value of that state.

Finally, we reset the episode memory.

If this algorithm still feels unclear, check the animation below for a visual walkthrough example of how values are updated over time!

Value Calculation Example, Source: Image by the author

Here, the trajectory represents state visits, and each update corresponds to the reward received when walking backwards from the end of the episode.

The final state in the trajectory is state 0 (the goal), which gives us a return of 24. And its value is also 24.
Moving backwards to state 1, we accumulate a reward of -1, so the return becomes 23. Since this is the first visit to state 1, the count is 1, so the value becomes 23/1 = 23.
Then we go to state 2, again with a reward of -1, so the return is now 22.
Finally, we go back to state 1 again. Here’s where it gets interesting:
The state was visited again, so the count increases to 2.
The return now is 21, calculated from the previous return for state 2 (22–1).
This makes sense because if you trace the full return path from the end: 24−1−1−1=21, you still arrive at the same return value.
But here, we’re using a single variable R to accumulate rewards step-by-step in reverse, updating the return for each visited state as we move backward.

This is a great example of how multiple experiences of the same state can affect its estimated value. Even though this example uses only one episode, it demonstrates how Monte Carlo methods allow agents to average over different experiences to form better value estimates.
At first glance, it might seem noisy, but when the agent experiences thousands of episodes, these averages begin to truly reflect the value of each state.

▶️Run Code

env = gymnasium.make("env/Maze-v0", render_mode="human", reg_r=-1)
agent = Agent(env.unwrapped)
values = agent.get_values()
env.unwrapped.set_values(values)

total_episodes = 5 

for ep in range(total_episodes):
 state, _ = env.reset(seed=1221)
 episode_over = False
 t = 0 
 while not episode_over:
 action = agent.get_action(state)
 next_state, reward, terminated, truncated, _ = env.step(action)
 episode_over = terminated or truncated
 agent.update(state['agent'], action, reward, next_state['agent'], episode_over)
 state = next_state
 t += 1

 print(f"Episode {ep + 1} finished")

env.close()

Same run as before, but this time we seeded the environment with seed=1221 to ensure the maze is identical on each run. We also set the regular step reward to -1.

Value Calculation, Monte Carlo Style, Source: Image by the author

As you observe these 4 trajectories, pay attention to how the state values evolve as each episode completes. If it’s hard to spot the changes, try slowing down the FPS (in Env class), you’ll notice how the values get updated.
In the last trajectory, our poor robot got a bit lost, which caused the returns (and hence the values) of the visited states to drop significantly.
But that’s the point; with thousands of episodes, these values begin to stabilize, and the averages start to reflect the true long-term value of each state!

🎯 Initial Values Matter!

One last thing: try changing the reg_r (regular reward) to 0 and observe how the values change.

Values When Regular Reward is equal to 0, Source: Image by the author

-What happened? Why are all the values 24?

Well, here’s why:

When we set reg_r = -1, we’re essentially penalizing every step. It’s like telling the agent:
“You’re losing points for every move unless you reach the goal!”
That forces the agent to find the shortest path, minimizing the penalty.

But when we set reg_r = 0, we’re saying:
“Take as many steps as you want, it doesn’t matter; just reach the goal!”
So every state that eventually leads to the goal ends up with the same return: 24 (the reward at the goal). Since all paths lead to the goal!

Isn’t it cool how math calculations directly reflect these real-world scenarios?
This is why defining your RL problem carefully is super important; those initial settings (like rewards) shape how your agent learns and behaves.
So next time you’re designing a reward system, think about what behavior you’re encouraging. It might surprise you!

✅ What We’ve Learned…

This tutorial was a bit long, but pretty straightforward (hopefully!)
We learned how to calculate returns and state values in our maze problem by setting manual trajectories for the agent to follow. After each episode finished, we applied Monte Carlo-style updates to show how the values evolve.
We also explored a visual example of how value calculations happen step by step, and finally saw how initial values and reward settings can significantly affect the agent’s behavior. This highlights just how important the problem definition is in reinforcement learning. We didn’t include gamma just yet, to keep the code simple, but next time, we’ll explore a clever way to introduce the discount factor and more!

Check out the new tutorial on how we can improve the value function even more:

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

pub.towardsai.net

✨ As always… stay curious, stay coding, and stay tuned!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Implementing the Value Function the Monte Carlo Way

Author(s): Rem E

Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo style

🔍What You’ll Learn

🛠️Project Setup

🌊Before You Dive In…

🤖Agent Class Implementation

▶️Run Code

🎯 Initial Values Matter!

✅ What We’ve Learned…

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Implementing the Value Function the Monte Carlo Way

Author(s): Rem E

Tutorial 5: In this tutorial, we’ll see in action how returns and state values are calculated using the Monte Carlo style

🔍What You’ll Learn

🛠️Project Setup

🌊Before You Dive In…

🤖Agent Class Implementation

▶️Run Code

🎯 Initial Values Matter!

✅ What We’ve Learned…

The Clever Way to Calculate Values, Bellman’s “Secret”

Tutorial-6: This time, we’ll update our values as the agent moves through the maze, using Bellman’s so-called “secret”

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement