
The Whole Story of MDP in RL
Author(s): Rem E
Originally published on Towards AI.
I’ve mentioned MDP (Markov Decision Process) several times, and it frequently appears in RL. But what exactly is an MDP, and why is it so important in RL?
We’ll explore that together in this article! But first, if you’re new to RL and want to understand its basics, check out The Building Blocks of RL.
🕰 Markov Property
The Markov Property is an assumption that the prediction of the next state and reward depends only on the current state and action, not on the full history of past states and actions. This is also known as the independence of path, meaning the entire history(path) does not influence the transition and reward probabilities; only the present matters. To formalize this assumption, we use conditional probability to show that the prediction remains the same whether or not we include previous states and actions. If the new state and reward are independent of the history given the current state and action, their conditional probabilities remain unchanged.

Predicting next state sₜ₊₁ and next reward rₜ₊₁ given current state sₜ and current action aₜ is equal to predicting next state sₜ₊₁ and next reward rₜ₊₁ given the entire history of states and actions up to t.
This simply means that having the history of previous states and actions won’t affect the probability; it’s independent of it. In other words:
The future is independent of the past, given the present.
To satisfy the Markov property, the state must retain all relevant information needed to make decisions. In other words, the state should be a complete summary of the environment’s status.
But do we always have to follow this assumption strictly?
Not necessarily, in many real-world situations, the environment is partially observable, and the current state may not fully capture all the information needed for optimal decisions. Still, it’s common to approximate the Markov property when designing reinforcement learning solutions. This lets us model the problem as a Markov Decision Process (MDP), which is mathematically convenient and widely supported.
However, when the Markov assumption isn’t appropriate, other techniques can help the agent make better decisions.
🗺️Markov Decision Process
An RL task that satisfies the Markov property is called a Markov Decision Process (MDP). Most of the core ideas behind MDPs are things we’ve already talked about; it’s simply a formal way to model an RL problem.
You can think of it as:
- Environment (Problem)
- Agent (Solution)
We’ve already discussed this informally in the RL framework, but now let’s define it formally!

MDP is a five-tuple: <S, A, P, R, γ>
S = State Space.
A = Action Space.
P = The probability of transitioning to the next state sₜ₊₁ given the current state sₜ and the current action aₜ.
R = The expected immediate reward rₜ₊₁ given the current state sₜ, the current action aₜ, and the next state sₜ₊₁
γ = discount factor
State space and actions are closely related concepts already explained earlier; refer to the first part of How Does an RL Agent Actually Decide What to Do? One important note: actions are considered part of the agent, not the environment. Although an MDP is a formal model of the environment, it defines what actions can be performed within that environment. However, it’s the agent that is responsible for choosing and executing those actions.
Remember when we mentioned that the environment has transition dynamics? This refers to how transitions between states occur. Sometimes, even if we perform the same action in the same state, the environment might transition us to different next states. Such an environment is called stochastic. Our maze example’s environment is deterministic because every next state is fully determined by the current state and action.
Deterministic environment: The same action in the same state always leads to the same next state and reward.
Stochastic environment: The same action in the same state can lead to different next states or rewards, based on probabilities.
The same concept applies to rewards: they can also be stochastic. This means that performing the same action in the same state can produce different rewards.
Why do we use probabilities for transitions but expected values for rewards, even though both are considered stochastic? Simply because we are answering two different questions:
- For transitions: Where will I land next?
- For Rewards: How much reward can I expect to get on average if this transition happens (given the current state and action)?
Transitions are events with associated probabilities, while rewards are real-valued numbers.
Lastly, the discount factor γ is the only addition we make to the MDP definition in this framework. It determines how future rewards are valued relative to immediate rewards and will be used later when we discuss the solution.
📊MDP Example:
🌀Maze Problem:
Our maze problem is already an MDP, where:
- States: The agent’s positions on the grid.
- Actions: The agent’s movements (left, right, up, down).
- Transition dynamics: (Deterministic) The next state is simply computed by adding the agent’s position (state) to the movement direction (action).
- Rewards: (Deterministic) 0 for all states and 1 for reaching the target.
- Discount factor γ: hasn’t been mentioned yet, will come later.
However, this is a fully deterministic MDP. We need another example: a stochastic MDP!
🧹Vacuum Cleaner Problem:

This is a state machine diagram, commonly used to visualize MDPs. The circles represent states, and the arrows represent actions along with their probabilities.
- States: {Clean room, Dirty room}
- Actions: {Clean (try to clean the room), Move (move to another room)}
- Transition dynamics:
– From Dirty room, action Clean:
80% chance the room becomes Clean (successful cleaning)
20% chance the room stays Dirty (cleaning failed or interrupted)
– From Clean room, action Move:
90% chance the robot moves to a Dirty room
10% chance it stays in the Clean room (got stuck or error) - Rewards:
Cleaning a dirty room successfully: +10 with 70% chance, or +8 with 30% chance (cleaning quality varies)
Attempting to clean a clean room: -1 (deterministic wasted effort)
Moving between rooms: -0.5 with 80% chance, or -1.0 with 20% chance (varying energy cost) - Discount factor: 0.95
Hopefully, the difference between deterministic and stochastic MDPs is clear now, as we will build on these concepts in the rest of the theory!
✅ What We’ve Learned…
We explored the foundations of Reinforcement Learning through the lens of Markov Decision Processes (MDPs). We saw that an MDP models an environment where the Markov property holds. We defined MDPs formally as a tuple of states, actions, transition probabilities, rewards, and a discount factor γ. Using two examples, we illustrated how transitions and rewards can be either fixed or probabilistic, highlighting the difference between deterministic and stochastic environments.
So far, all our learning has been about defining the problem, but what about the solution? I think we’re now ready to explore how RL agents smartly solve it. See you next time!
👉Next up: See it all in action!
Implementing MDP for the Maze Problem
Tutorial 4: We’ll modify our previous maze code to fit the RL framework and match the definition of an MDP.
pub.towardsai.net
👉 Then, level it up:
Refine our value function to better guide the agent toward its goal!
Our Neat Value Function
Let’s take it a step further and refine our value function to better guide the agent toward its goal!
pub.towardsai.net
✨ As always… stay curious, stay coding, and stay tuned!
📚References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Wikipedia contributors. Markov decision process. Wikipedia. From https://en.wikipedia.org/wiki/Markov_decision_process
- Spiceworks Staff. What is a Markov Decision Process? Spiceworks. From https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-markov-decision-process/
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.