The Whole Story of MDP in RL

Author(s): Rem E

Originally published on Towards AI.

I’ve mentioned MDP (Markov Decision Process) several times, and it frequently appears in RL. But what exactly is an MDP, and why is it so important in RL?
We’ll explore that together in this article! But first, if you’re new to RL and want to understand its basics, check out The Building Blocks of RL.

🕰 Markov Property

The Markov Property is an assumption that the prediction of the next state and reward depends only on the current state and action, not on the full history of past states and actions. This is also known as the independence of path, meaning the entire history(path) does not influence the transition and reward probabilities; only the present matters. To formalize this assumption, we use conditional probability to show that the prediction remains the same whether or not we include previous states and actions. If the new state and reward are independent of the history given the current state and action, their conditional probabilities remain unchanged.

The Whole Story of MDP in RL — Markov Property, Source: Image by the author

Predicting next state sₜ₊₁ and next reward rₜ₊₁ given current state sₜ and current action aₜ is equal to predicting next state sₜ₊₁ and next reward rₜ₊₁ given the entire history of states and actions up to t.

This simply means that having the history of previous states and actions won’t affect the probability; it’s independent of it. In other words:

The future is independent of the past, given the present.

To satisfy the Markov property, the state must retain all relevant information needed to make decisions. In other words, the state should be a complete summary of the environment’s status.
But do we always have to follow this assumption strictly?
Not necessarily, in many real-world situations, the environment is partially observable, and the current state may not fully capture all the information needed for optimal decisions. Still, it’s common to approximate the Markov property when designing reinforcement learning solutions. This lets us model the problem as a Markov Decision Process (MDP), which is mathematically convenient and widely supported.
However, when the Markov assumption isn’t appropriate, other techniques can help the agent make better decisions.

🗺️Markov Decision Process

An RL task that satisfies the Markov property is called a Markov Decision Process (MDP). Most of the core ideas behind MDPs are things we’ve already talked about; it’s simply a formal way to model an RL problem.
You can think of it as:

Environment (Problem)
Agent (Solution)

We’ve already discussed this informally in the RL framework, but now let’s define it formally!

MDP Formal Definiton, Source: Image by the author

MDP is a five-tuple: <S, A, P, R, γ>
S = State Space.
A = Action Space.
P = The probability of transitioning to the next state sₜ₊₁ given the current state sₜ and the current action aₜ.
R = The expected immediate reward rₜ₊₁ given the current state sₜ, the current action aₜ, and the next state sₜ₊₁
γ = discount factor

State space and actions are closely related concepts already explained earlier; refer to the first part of How Does an RL Agent Actually Decide What to Do? One important note: actions are considered part of the agent, not the environment. Although an MDP is a formal model of the environment, it defines what actions can be performed within that environment. However, it’s the agent that is responsible for choosing and executing those actions.

Remember when we mentioned that the environment has transition dynamics? This refers to how transitions between states occur. Sometimes, even if we perform the same action in the same state, the environment might transition us to different next states. Such an environment is called stochastic. Our maze example’s environment is deterministic because every next state is fully determined by the current state and action.

Deterministic environment: The same action in the same state always leads to the same next state and reward.
Stochastic environment: The same action in the same state can lead to different next states or rewards, based on probabilities.

The same concept applies to rewards: they can also be stochastic. This means that performing the same action in the same state can produce different rewards.
Why do we use probabilities for transitions but expected values for rewards, even though both are considered stochastic? Simply because we are answering two different questions:

For transitions: Where will I land next?
For Rewards: How much reward can I expect to get on average if this transition happens (given the current state and action)?

Transitions are events with associated probabilities, while rewards are real-valued numbers.

Lastly, the discount factor γ is the only addition we make to the MDP definition in this framework. It determines how future rewards are valued relative to immediate rewards and will be used later when we discuss the solution.

📊MDP Example:

🌀Maze Problem:

Our maze problem is already an MDP, where:

States: The agent’s positions on the grid.
Actions: The agent’s movements (left, right, up, down).
Transition dynamics: (Deterministic) The next state is simply computed by adding the agent’s position (state) to the movement direction (action).
Rewards: (Deterministic) 0 for all states and 1 for reaching the target.
Discount factor γ: hasn’t been mentioned yet, will come later.

However, this is a fully deterministic MDP. We need another example: a stochastic MDP!

🧹Vacuum Cleaner Problem:

Vacuum Cleaner MDP, Source: Image by the author

This is a state machine diagram, commonly used to visualize MDPs. The circles represent states, and the arrows represent actions along with their probabilities.

States: {Clean room, Dirty room}
Actions: {Clean (try to clean the room), Move (move to another room)}
Transition dynamics:
– From Dirty room, action Clean:
80% chance the room becomes Clean (successful cleaning)
20% chance the room stays Dirty (cleaning failed or interrupted)
– From Clean room, action Move:
90% chance the robot moves to a Dirty room
10% chance it stays in the Clean room (got stuck or error)
Rewards:
Cleaning a dirty room successfully: +10 with 70% chance, or +8 with 30% chance (cleaning quality varies)
Attempting to clean a clean room: -1 (deterministic wasted effort)
Moving between rooms: -0.5 with 80% chance, or -1.0 with 20% chance (varying energy cost)
Discount factor: 0.95

Hopefully, the difference between deterministic and stochastic MDPs is clear now, as we will build on these concepts in the rest of the theory!

✅ What We’ve Learned…

We explored the foundations of Reinforcement Learning through the lens of Markov Decision Processes (MDPs). We saw that an MDP models an environment where the Markov property holds. We defined MDPs formally as a tuple of states, actions, transition probabilities, rewards, and a discount factor γ. Using two examples, we illustrated how transitions and rewards can be either fixed or probabilistic, highlighting the difference between deterministic and stochastic environments.

So far, all our learning has been about defining the problem, but what about the solution? I think we’re now ready to explore how RL agents smartly solve it. See you next time!

👉Next up: See it all in action!

📚References:

Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Wikipedia contributors. Markov decision process. Wikipedia. From https://en.wikipedia.org/wiki/Markov_decision_process
Spiceworks Staff. What is a Markov Decision Process? Spiceworks. From https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-markov-decision-process/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

The Whole Story of MDP in RL

Author(s): Rem E

🕰 Markov Property

🗺️Markov Decision Process

📊MDP Example:

🌀Maze Problem:

🧹Vacuum Cleaner Problem:

✅ What We’ve Learned…

Implementing MDP for the Maze Problem

Tutorial 4: We’ll modify our previous maze code to fit the RL framework and match the definition of an MDP.

Our Neat Value Function

Let’s take it a step further and refine our value function to better guide the agent toward its goal!

📚References:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Whole Story of MDP in RL

Author(s): Rem E

🕰 Markov Property

🗺️Markov Decision Process

📊MDP Example:

🌀Maze Problem:

🧹Vacuum Cleaner Problem:

✅ What We’ve Learned…

Implementing MDP for the Maze Problem

Tutorial 4: We’ll modify our previous maze code to fit the RL framework and match the definition of an MDP.

Our Neat Value Function

Let’s take it a step further and refine our value function to better guide the agent toward its goal!

📚References:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement