Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

The Whole Story of MDP in RL
Latest   Machine Learning

The Whole Story of MDP in RL

Author(s): Rem E

Originally published on Towards AI.

I’ve mentioned MDP (Markov Decision Process) several times, and it frequently appears in RL. But what exactly is an MDP, and why is it so important in RL?
We’ll explore that together in this article! But first, if you’re new to RL and want to understand its basics, check out The Building Blocks of RL.

🕰 Markov Property

The Markov Property is an assumption that the prediction of the next state and reward depends only on the current state and action, not on the full history of past states and actions. This is also known as the independence of path, meaning the entire history(path) does not influence the transition and reward probabilities; only the present matters. To formalize this assumption, we use conditional probability to show that the prediction remains the same whether or not we include previous states and actions. If the new state and reward are independent of the history given the current state and action, their conditional probabilities remain unchanged.

The Whole Story of MDP in RL
Markov Property, Source: Image by the author

Predicting next state sₜ₊₁ and next reward rₜ₊₁ given current state sₜ and current action aₜ is equal to predicting next state sₜ₊₁ and next reward rₜ₊₁ given the entire history of states and actions up to t.

This simply means that having the history of previous states and actions won’t affect the probability; it’s independent of it. In other words:

The future is independent of the past, given the present.

To satisfy the Markov property, the state must retain all relevant information needed to make decisions. In other words, the state should be a complete summary of the environment’s status.
But do we always have to follow this assumption strictly?
Not necessarily, in many real-world situations, the environment is partially observable, and the current state may not fully capture all the information needed for optimal decisions. Still, it’s common to approximate the Markov property when designing reinforcement learning solutions. This lets us model the problem as a Markov Decision Process (MDP), which is mathematically convenient and widely supported.
However, when the Markov assumption isn’t appropriate, other techniques can help the agent make better decisions.

🗺️Markov Decision Process

An RL task that satisfies the Markov property is called a Markov Decision Process (MDP). Most of the core ideas behind MDPs are things we’ve already talked about; it’s simply a formal way to model an RL problem.
You can think of it as:

  • Environment (Problem)
  • Agent (Solution)

We’ve already discussed this informally in the RL framework, but now let’s define it formally!

MDP Formal Definiton, Source: Image by the author

MDP is a five-tuple: <S, A, P, R, γ>
S =
State Space.
A =
Action Space.
P =
The probability of transitioning to the next state sₜ₊₁ given the current state sₜ and the current action aₜ.
R =
The expected immediate reward rₜ₊₁ given the current state sₜ, the current action aₜ, and the next state sₜ₊₁
γ =
discount factor

State space and actions are closely related concepts already explained earlier; refer to the first part of How Does an RL Agent Actually Decide What to Do? One important note: actions are considered part of the agent, not the environment. Although an MDP is a formal model of the environment, it defines what actions can be performed within that environment. However, it’s the agent that is responsible for choosing and executing those actions.

Remember when we mentioned that the environment has transition dynamics? This refers to how transitions between states occur. Sometimes, even if we perform the same action in the same state, the environment might transition us to different next states. Such an environment is called stochastic. Our maze example’s environment is deterministic because every next state is fully determined by the current state and action.

Deterministic environment: The same action in the same state always leads to the same next state and reward.
Stochastic environment: The same action in the same state can lead to different next states or rewards, based on probabilities.

The same concept applies to rewards: they can also be stochastic. This means that performing the same action in the same state can produce different rewards.
Why do we use probabilities for transitions but expected values for rewards, even though both are considered stochastic? Simply because we are answering two different questions:

  • For transitions: Where will I land next?
  • For Rewards: How much reward can I expect to get on average if this transition happens (given the current state and action)?

Transitions are events with associated probabilities, while rewards are real-valued numbers.

Lastly, the discount factor γ is the only addition we make to the MDP definition in this framework. It determines how future rewards are valued relative to immediate rewards and will be used later when we discuss the solution.

📊MDP Example:

🌀Maze Problem:

Our maze problem is already an MDP, where:

  • States: The agent’s positions on the grid.
  • Actions: The agent’s movements (left, right, up, down).
  • Transition dynamics: (Deterministic) The next state is simply computed by adding the agent’s position (state) to the movement direction (action).
  • Rewards: (Deterministic) 0 for all states and 1 for reaching the target.
  • Discount factor γ: hasn’t been mentioned yet, will come later.

However, this is a fully deterministic MDP. We need another example: a stochastic MDP!

🧹Vacuum Cleaner Problem:

Vacuum Cleaner MDP, Source: Image by the author

This is a state machine diagram, commonly used to visualize MDPs. The circles represent states, and the arrows represent actions along with their probabilities.

  • States: {Clean room, Dirty room}
  • Actions: {Clean (try to clean the room), Move (move to another room)}
  • Transition dynamics:
    From Dirty room, action Clean:
    80% chance the room becomes Clean (successful cleaning)
    20% chance the room stays Dirty (cleaning failed or interrupted)
    – From Clean room, action Move:
    90% chance the robot moves to a Dirty room
    10% chance it stays in the Clean room (got stuck or error)
  • Rewards:
    Cleaning a dirty room successfully: +10 with 70% chance, or +8 with 30% chance (cleaning quality varies)
    Attempting to clean a clean room: -1 (deterministic wasted effort)
    Moving between rooms: -0.5 with 80% chance, or -1.0 with 20% chance (varying energy cost)
  • Discount factor: 0.95

Hopefully, the difference between deterministic and stochastic MDPs is clear now, as we will build on these concepts in the rest of the theory!

✅ What We’ve Learned…

We explored the foundations of Reinforcement Learning through the lens of Markov Decision Processes (MDPs). We saw that an MDP models an environment where the Markov property holds. We defined MDPs formally as a tuple of states, actions, transition probabilities, rewards, and a discount factor γ. Using two examples, we illustrated how transitions and rewards can be either fixed or probabilistic, highlighting the difference between deterministic and stochastic environments.

So far, all our learning has been about defining the problem, but what about the solution? I think we’re now ready to explore how RL agents smartly solve it. See you next time!

👉Next up: See it all in action!

Implementing MDP for the Maze Problem

Tutorial 4: We’ll modify our previous maze code to fit the RL framework and match the definition of an MDP.

pub.towardsai.net

👉 Then, level it up:
Refine our value function to better guide the agent toward its goal!

Our Neat Value Function

Let’s take it a step further and refine our value function to better guide the agent toward its goal!

pub.towardsai.net

As always… stay curious, stay coding, and stay tuned!

📚References:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.