PPO Explained and Its Constraints: Introducing PDPPO as an Alternative

Last Updated on April 19, 2025 by Editorial Team

Author(s): Leonardo Kanashiro Felizardo

Originally published on Towards AI.

PPO Explained and Its Constraints: Introducing PDPPO as an Alternative

What is PPO, and Why is it Popular?

Proximal Policy Optimization (PPO) has rapidly emerged as a leading model-free reinforcement learning (RL) method due to its simplicity and strong performance across various domains. PPO combines trust-region policy optimization and clipped objective optimization to ensure stable and efficient policy updates.

Explanation of PPO

PPO addresses the limitations of previous RL methods like vanilla policy gradient and TRPO (Trust Region Policy Optimization) by balancing exploration and exploitation through controlled policy updates. PPO specifically aims to stabilize training by preventing overly large policy updates, which could lead to catastrophic forgetting or divergence.

Actor-Critic and the Role of Advantage Estimation

PPO belongs to the family of actor-critic algorithms, where two models work together:

The actor updates the policy π(θ,a|s) by selecting actions based on states.
The critic evaluates the actor’s decisions by estimating the value function V(π,s).

This architecture was first formalized by Konda and Tsitsiklis in their seminal work Actor-Critic Algorithms, as shown in Konda et at. [1], where they demonstrated convergence properties and laid the mathematical foundation for combining policy gradient methods with value function estimation.

The advantage function is a critical concept in this setting, defined as:

This is a minimal and clean example of how to implement an Actor-Critic architecture in PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

class ActorCritic(nn.Module):
 def __init__(self, state_dim, action_dim):
 super().__init__()
 self.shared = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU())
 self.actor = nn.Linear(128, action_dim)
 self.critic = nn.Linear(128, 1)
 def forward(self, x):
 x = self.shared(x)
 return self.actor(x), self.critic(x)# Example usage
state_dim = 4
action_dim = 2
model = ActorCritic(state_dim, action_dim)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
state = torch.rand((1, state_dim))
logits, value = model(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)# Mock advantage and return
advantage = torch.tensor([1.0])
return_ = torch.tensor([[1.5]])# Actor-Critic loss
actor_loss = -log_prob * advantage
critic_loss = (value - return_).pow(2).mean()
loss = actor_loss + critic_loss# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()

PPO Objective and Mathematics

The core idea behind PPO is the optimization of the policy network through a clipped objective function:

Here:

θ represents the parameters of the policy.
ε is a hyperparameter typically small (e.g., 0.2) controlling how much the policy can change at each step.
A is the advantage function, indicating the relative improvement of taking a specific action compared to the average action.

The probability ratio is defined as:

This ratio quantifies how much the probability of selecting an action has changed from the old policy to the new one.

PyTorch Code Example: PPO Core

import torch
import torch.nn as nn
import torch.optim as optim

# Assume we already have: states, actions, old_log_probs, returns, values
# And a model with .actor and .critic modules
clip_epsilon = 0.2
gamma = 0.99# Compute advantages
advantages = returns - values
discounted_advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# Get new log probabilities and state values
log_probs = model.actor.get_log_probs(states, actions)
ratios = torch.exp(log_probs - old_log_probs.detach())# Clipped surrogate objective
surr1 = ratios * discounted_advantages
surr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * discounted_advantages
policy_loss = -torch.min(surr1, surr2).mean()# Critic loss (value function)
value_estimates = model.critic(states)
critic_loss = nn.MSELoss()(value_estimates, returns)# Total loss
total_loss = policy_loss + 0.5 * critic_loss# Backpropagation
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()

PPO’s Advantages and Popularity

PPO’s popularity stems from its:

Simplicity: Easier to implement and tune compared to other sophisticated methods like TRPO.
Efficiency: Faster convergence due to the clipped surrogate objective, reducing the need for careful hyperparameter tuning.
Versatility: Robust performance across a wide range of tasks including robotics, games, and operational management problems.

Flaws and Limitations of PPO

Despite PPO’s successes, it faces several limitations:

High Variance and Instability: PPO’s reliance on sample-based estimates can cause significant variance in policy updates, especially in environments with sparse rewards or long horizons.
Exploration Inefficiency: PPO typically relies on Gaussian noise for exploration, which can lead to insufficient exploration, especially in complex, high-dimensional state spaces.
Sensitivity to Initialization: PPO’s effectiveness can vary greatly depending on initial conditions, causing inconsistent results across training runs.

Enter PDPPO: A Novel Improvement

To overcome these limitations, Post-Decision Proximal Policy Optimization (PDPPO) introduces a novel approach using dual critic networks and post-decision states.

Understanding Post-Decision States

Post-decision states, introduced by Warren B. Powell [2], provide a powerful abstraction in reinforcement learning. A post-decision state represents the environment immediately after an agent has taken an action but before the environment’s stochastic response occurs.

This allows the learning algorithm to decompose the transition dynamics into two parts:

Deterministic step (decision):

This representes the state right after the deterministric effects take place.

Stochastic step (nature’s response):

As soon as we observe the deterministric effects, we also account for the stochastic variables that change the state.

Where:

f represents the deterministic function mapping the current state and action to the post-decision state sˣ.
η is a random variable capturing the environment’s stochasticity.
g defines how this stochastic component affects the next state.
s’ is the next state

Example: Frozen Lake

Imagine the Frozen Lake environment. The agent chooses to move right from a given tile. The action is deterministic — the intention to move right is clear. This gives us the post-decision state sˣ: “attempted to move right.”

However, because the ice is slippery, the agent may not land on the intended tile. It might slide right, down, or stay in place, with a certain probability for each. That final position — determined after the slippage — is the true next state s’.

This decomposition allows value functions to be better estimated:

Pre-decision value function:

Post-decision value function:

This formulation helps decouple the decision from stochastic effects, reducing variance in value estimation and improving sample efficiency.

Post-Decision Advantage Calculation

Given both critics, PDPPO computes the advantage as:

And selects the most informative advantage at each step:

This “maximum advantage” strategy allows the actor to favor the most promising value estimate during learning.

Updating the Critics and Policy

Critic loss functions:

Combined actor-critic loss:

This architecture, with separate value estimators for deterministic and stochastic effects, enables more stable learning in environments with complex uncertainty.

Dual Critic Networks

PDPPO employs two critics:

State Critic: Estimates the value function based on pre-decision states.
Post-Decision Critic: Estimates the value function based on post-decision states.

The dual-critic approach improves value estimation accuracy by capturing both deterministic and stochastic dynamics separately.

PyTorch Code Example: PDPPO Core

import torch
import torch.nn as nn
import torch.optim as optim

# Assume: states, actions, old_log_probs, returns, post_returns, 
# model with actor, critic, post_decision_criticclip_epsilon = 0.2# --- 1. Compute advantages from both critics ---
values = model.critic(states)
post_values = model.post_decision_critic(post_states)adv_pre = returns - values
adv_post = post_returns - post_values# Use the max advantage (PDPPO twist)
advantages = torch.max(adv_pre, adv_post)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# --- 2. Policy loss: same PPO-style clip ---
log_probs = model.actor.get_log_probs(states, actions)
ratios = torch.exp(log_probs - old_log_probs.detach())surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantages
policy_loss = -torch.min(surr1, surr2).mean()# --- 3. Dual critic loss ---
critic_loss = nn.MSELoss()(values, returns)
post_critic_loss = nn.MSELoss()(post_values, post_returns)# Total loss with dual critic
total_loss = policy_loss + 0.5 * (critic_loss + post_critic_loss)# --- 4. Backpropagation ---
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()

PDPPO vs PPO in Practice

Tests on environments such as Frozen Lake and Stochastic Lot-sizing highlight PDPPO’s significant performance improvements as in Felizardo et al. [3]:

Improved Stability Across Seeds
PDPPO showed lower variance in both cumulative and maximum rewards across different random seeds, particularly in stochastic environments like Frozen Lake. This indicates greater robustness to initialization compared to PPO, which often suffers from unstable learning in such settings.
Faster and Smoother Convergence
The learning curves of PDPPO are notably smoother and consistently trend upward, while PPO’s often stagnate or oscillate. This suggests that PDPPO’s dual-critic structure provides more accurate value estimates, enabling more reliable policy updates.
Better Scaling with Dimensionality
In the Stochastic Lot-Sizing tasks, PDPPO’s performance gap widened as the problem dimensionality increased (e.g., 25 items and 15 machines). This demonstrates that PDPPO scales better in complex settings, benefiting from its decomposition of dynamics into deterministic and stochastic parts.
More Informative Advantage Estimates
By using the maximum of pre- and post-decision advantages, PDPPO effectively captures the most optimistic learning signal at each step — leading to better exploitation of promising strategies without ignoring the stochastic nature of the environment.
Better Sample Efficiency
Empirical results showed that PDPPO achieved higher rewards using fewer training episodes, making it more sample-efficient — an essential trait for real-world applications where data collection is expensive.

Empirical comparison (20–30 Runs)

PDPPO significantly outperforms PPO across three environment configurations of the **Stochastic Lot-Sizing Problem.** The shaded areas represent 95% confidence intervals.

Faster convergence
Higher peak performance, and
Tighter variance bands for PDPPO.

A few other alternatives

A few other alternatives to address the limitations of PPO include:

Intrinsic Exploration Module (IEM)
Proposed by Zhang et al. [8], this approach enhances exploration by incorporating uncertainty estimation into PPO. It addresses PPO’s weak exploration signal by rewarding novelty, especially useful in sparse reward settings.
Uncertainty-Aware TRPO (UA-TRPO)
Introduced by Queeney et al. [7], UA-TRPO aims to stabilize policy updates in the presence of finite-sample estimation errors by accounting for uncertainty in the policy gradients — offering a more robust learning process than standard PPO.
Dual-Critic Variants
Previous methods, like SAC [4] and TD3 [5], use dual critics mainly for continuous action spaces to reduce overestimation bias. However, they typically do not incorporate post-decision states nor are designed for environments with both deterministic and stochastic dynamics.
Post-Decision Architectures in OR
Earlier work in operations research (e.g., Powell [2], Hull [6]) used post-decision states to manage the curse of dimensionality in approximate dynamic programming. PDPPO brings this insight into deep RL by using post-decision value functions directly in the learning process.

Each of these methods has its trade-offs, and PDPPO stands out by directly tackling the challenge of stochastic transitions via decomposition and dual critics — making it particularly effective in noisy, real-world-like settings.

Citation

[1] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-Critic Algorithms. In S.A. Solla, T.K. Leen, & K.-R. Müller (Eds.), Advances in Neural Information Processing Systems, Vol. 12. MIT Press.

[2] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality (2nd ed.). John Wiley & Sons.

[3] Felizardo, L. K., Fadda, E., Nascimento, M. C. V., Brandimarte, P., & Del-Moral-Hernandez, E. (2024). A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks. arXiv preprint arXiv:2504.05150. https://arxiv.org/pdf/2504.05150

[4] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML).

[5] Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML).

[6] Hull, I. (2015). Approximate Dynamic Programming with Post-Decision States as a Solution Method for Dynamic Economic Models. Journal of Economic Dynamics and Control, 55, 57–70.

[7] Queeney, J., Paschalidis, I. C., & Cassandras, C. G. (2021). Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(9), 9377–9385.

[8] Zhang, J., Zhang, Z., Han, S., & Lü, S. (2022). Proximal Policy Optimization via Enhanced Exploration Efficiency. Information Sciences, 609, 750–765.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

PPO Explained and Its Constraints: Introducing PDPPO as an Alternative

Author(s): Leonardo Kanashiro Felizardo

What is PPO, and Why is it Popular?

Explanation of PPO

Actor-Critic and the Role of Advantage Estimation

PPO Objective and Mathematics

PyTorch Code Example: PPO Core

PPO’s Advantages and Popularity

Flaws and Limitations of PPO

Enter PDPPO: A Novel Improvement

Understanding Post-Decision States

Example: Frozen Lake

Post-Decision Advantage Calculation

Updating the Critics and Policy

Dual Critic Networks

PyTorch Code Example: PDPPO Core

PDPPO vs PPO in Practice

Empirical comparison (20–30 Runs)

A few other alternatives

Citation

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

PPO Explained and Its Constraints: Introducing PDPPO as an Alternative

Author(s): Leonardo Kanashiro Felizardo

What is PPO, and Why is it Popular?

Explanation of PPO

Actor-Critic and the Role of Advantage Estimation

PPO Objective and Mathematics

PyTorch Code Example: PPO Core

PPO’s Advantages and Popularity

Flaws and Limitations of PPO

Enter PDPPO: A Novel Improvement

Understanding Post-Decision States

Example: Frozen Lake

Post-Decision Advantage Calculation

Updating the Critics and Policy

Dual Critic Networks

PyTorch Code Example: PDPPO Core

PDPPO vs PPO in Practice

Empirical comparison (20–30 Runs)

A few other alternatives

Citation

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement