PPO Explained and Its Constraints: Introducing PDPPO as an Alternative
Last Updated on April 19, 2025 by Editorial Team
Author(s): Leonardo Kanashiro Felizardo
Originally published on Towards AI.
What is PPO, and Why is it Popular?
Proximal Policy Optimization (PPO) has rapidly emerged as a leading model-free reinforcement learning (RL) method due to its simplicity and strong performance across various domains. PPO combines trust-region policy optimization and clipped objective optimization to ensure stable and efficient policy updates.
Explanation of PPO
PPO addresses the limitations of previous RL methods like vanilla policy gradient and TRPO (Trust Region Policy Optimization) by balancing exploration and exploitation through controlled policy updates. PPO specifically aims to stabilize training by preventing overly large policy updates, which could lead to catastrophic forgetting or divergence.
Actor-Critic and the Role of Advantage Estimation
PPO belongs to the family of actor-critic algorithms, where two models work together:
- The actor updates the policy Ο(ΞΈ,a|s) by selecting actions based on states.
- The critic evaluates the actorβs decisions by estimating the value function V(Ο,s).
This architecture was first formalized by Konda and Tsitsiklis in their seminal work Actor-Critic Algorithms, as shown in Konda et at. [1], where they demonstrated convergence properties and laid the mathematical foundation for combining policy gradient methods with value function estimation.
The advantage function is a critical concept in this setting, defined as:

This is a minimal and clean example of how to implement an Actor-Critic architecture in PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.shared = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU())
self.actor = nn.Linear(128, action_dim)
self.critic = nn.Linear(128, 1)
def forward(self, x):
x = self.shared(x)
return self.actor(x), self.critic(x)# Example usage
state_dim = 4
action_dim = 2
model = ActorCritic(state_dim, action_dim)
optimizer = optim.Adam(model.parameters(), lr=3e-4)
state = torch.rand((1, state_dim))
logits, value = model(state)
dist = torch.distributions.Categorical(logits=logits)
action = dist.sample()
log_prob = dist.log_prob(action)# Mock advantage and return
advantage = torch.tensor([1.0])
return_ = torch.tensor([[1.5]])# Actor-Critic loss
actor_loss = -log_prob * advantage
critic_loss = (value - return_).pow(2).mean()
loss = actor_loss + critic_loss# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
PPO Objective and Mathematics
The core idea behind PPO is the optimization of the policy network through a clipped objective function:

Here:
- ΞΈ represents the parameters of the policy.
- Ξ΅ is a hyperparameter typically small (e.g., 0.2) controlling how much the policy can change at each step.
- A is the advantage function, indicating the relative improvement of taking a specific action compared to the average action.
The probability ratio is defined as:

This ratio quantifies how much the probability of selecting an action has changed from the old policy to the new one.
PyTorch Code Example: PPO Core
import torch
import torch.nn as nn
import torch.optim as optim
# Assume we already have: states, actions, old_log_probs, returns, values
# And a model with .actor and .critic modules
clip_epsilon = 0.2
gamma = 0.99# Compute advantages
advantages = returns - values
discounted_advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# Get new log probabilities and state values
log_probs = model.actor.get_log_probs(states, actions)
ratios = torch.exp(log_probs - old_log_probs.detach())# Clipped surrogate objective
surr1 = ratios * discounted_advantages
surr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * discounted_advantages
policy_loss = -torch.min(surr1, surr2).mean()# Critic loss (value function)
value_estimates = model.critic(states)
critic_loss = nn.MSELoss()(value_estimates, returns)# Total loss
total_loss = policy_loss + 0.5 * critic_loss# Backpropagation
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()
PPOβs Advantages and Popularity
PPOβs popularity stems from its:
- Simplicity: Easier to implement and tune compared to other sophisticated methods like TRPO.
- Efficiency: Faster convergence due to the clipped surrogate objective, reducing the need for careful hyperparameter tuning.
- Versatility: Robust performance across a wide range of tasks including robotics, games, and operational management problems.
Flaws and Limitations of PPO
Despite PPOβs successes, it faces several limitations:
- High Variance and Instability: PPOβs reliance on sample-based estimates can cause significant variance in policy updates, especially in environments with sparse rewards or long horizons.
- Exploration Inefficiency: PPO typically relies on Gaussian noise for exploration, which can lead to insufficient exploration, especially in complex, high-dimensional state spaces.
- Sensitivity to Initialization: PPOβs effectiveness can vary greatly depending on initial conditions, causing inconsistent results across training runs.
Enter PDPPO: A Novel Improvement
To overcome these limitations, Post-Decision Proximal Policy Optimization (PDPPO) introduces a novel approach using dual critic networks and post-decision states.
Understanding Post-Decision States
Post-decision states, introduced by Warren B. Powell [2], provide a powerful abstraction in reinforcement learning. A post-decision state represents the environment immediately after an agent has taken an action but before the environmentβs stochastic response occurs.
This allows the learning algorithm to decompose the transition dynamics into two parts:
- Deterministic step (decision):
This representes the state right after the deterministric effects take place.

- Stochastic step (natureβs response):
As soon as we observe the deterministric effects, we also account for the stochastic variables that change the state.

Where:
- f represents the deterministic function mapping the current state and action to the post-decision state sΛ£.
- Ξ· is a random variable capturing the environmentβs stochasticity.
- g defines how this stochastic component affects the next state.
- sβ is the next state
Example: Frozen Lake
Imagine the Frozen Lake environment. The agent chooses to move right from a given tile. The action is deterministic β the intention to move right is clear. This gives us the post-decision state sΛ£: βattempted to move right.β
However, because the ice is slippery, the agent may not land on the intended tile. It might slide right, down, or stay in place, with a certain probability for each. That final position β determined after the slippage β is the true next state sβ.



This decomposition allows value functions to be better estimated:
- Pre-decision value function:

- Post-decision value function:

This formulation helps decouple the decision from stochastic effects, reducing variance in value estimation and improving sample efficiency.
Post-Decision Advantage Calculation
Given both critics, PDPPO computes the advantage as:


And selects the most informative advantage at each step:

This βmaximum advantageβ strategy allows the actor to favor the most promising value estimate during learning.
Updating the Critics and Policy
Critic loss functions:


Combined actor-critic loss:

This architecture, with separate value estimators for deterministic and stochastic effects, enables more stable learning in environments with complex uncertainty.
Dual Critic Networks
PDPPO employs two critics:
- State Critic: Estimates the value function based on pre-decision states.
- Post-Decision Critic: Estimates the value function based on post-decision states.
The dual-critic approach improves value estimation accuracy by capturing both deterministic and stochastic dynamics separately.
PyTorch Code Example: PDPPO Core
import torch
import torch.nn as nn
import torch.optim as optim
# Assume: states, actions, old_log_probs, returns, post_returns,
# model with actor, critic, post_decision_criticclip_epsilon = 0.2# --- 1. Compute advantages from both critics ---
values = model.critic(states)
post_values = model.post_decision_critic(post_states)adv_pre = returns - values
adv_post = post_returns - post_values# Use the max advantage (PDPPO twist)
advantages = torch.max(adv_pre, adv_post)
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)# --- 2. Policy loss: same PPO-style clip ---
log_probs = model.actor.get_log_probs(states, actions)
ratios = torch.exp(log_probs - old_log_probs.detach())surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1.0 - clip_epsilon, 1.0 + clip_epsilon) * advantages
policy_loss = -torch.min(surr1, surr2).mean()# --- 3. Dual critic loss ---
critic_loss = nn.MSELoss()(values, returns)
post_critic_loss = nn.MSELoss()(post_values, post_returns)# Total loss with dual critic
total_loss = policy_loss + 0.5 * (critic_loss + post_critic_loss)# --- 4. Backpropagation ---
optimizer.zero_grad()
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.5)
optimizer.step()
PDPPO vs PPO in Practice
Tests on environments such as Frozen Lake and Stochastic Lot-sizing highlight PDPPOβs significant performance improvements as in Felizardo et al. [3]:
- Improved Stability Across Seeds
PDPPO showed lower variance in both cumulative and maximum rewards across different random seeds, particularly in stochastic environments like Frozen Lake. This indicates greater robustness to initialization compared to PPO, which often suffers from unstable learning in such settings. - Faster and Smoother Convergence
The learning curves of PDPPO are notably smoother and consistently trend upward, while PPOβs often stagnate or oscillate. This suggests that PDPPOβs dual-critic structure provides more accurate value estimates, enabling more reliable policy updates. - Better Scaling with Dimensionality
In the Stochastic Lot-Sizing tasks, PDPPOβs performance gap widened as the problem dimensionality increased (e.g., 25 items and 15 machines). This demonstrates that PDPPO scales better in complex settings, benefiting from its decomposition of dynamics into deterministic and stochastic parts. - More Informative Advantage Estimates
By using the maximum of pre- and post-decision advantages, PDPPO effectively captures the most optimistic learning signal at each step β leading to better exploitation of promising strategies without ignoring the stochastic nature of the environment. - Better Sample Efficiency
Empirical results showed that PDPPO achieved higher rewards using fewer training episodes, making it more sample-efficient β an essential trait for real-world applications where data collection is expensive.
Empirical comparison (20β30 Runs)

- Faster convergence
- Higher peak performance, and
- Tighter variance bands for PDPPO.
A few other alternatives
A few other alternatives to address the limitations of PPO include:
- Intrinsic Exploration Module (IEM)
Proposed by Zhang et al. [8], this approach enhances exploration by incorporating uncertainty estimation into PPO. It addresses PPOβs weak exploration signal by rewarding novelty, especially useful in sparse reward settings. - Uncertainty-Aware TRPO (UA-TRPO)
Introduced by Queeney et al. [7], UA-TRPO aims to stabilize policy updates in the presence of finite-sample estimation errors by accounting for uncertainty in the policy gradients β offering a more robust learning process than standard PPO. - Dual-Critic Variants
Previous methods, like SAC [4] and TD3 [5], use dual critics mainly for continuous action spaces to reduce overestimation bias. However, they typically do not incorporate post-decision states nor are designed for environments with both deterministic and stochastic dynamics. - Post-Decision Architectures in OR
Earlier work in operations research (e.g., Powell [2], Hull [6]) used post-decision states to manage the curse of dimensionality in approximate dynamic programming. PDPPO brings this insight into deep RL by using post-decision value functions directly in the learning process.
Each of these methods has its trade-offs, and PDPPO stands out by directly tackling the challenge of stochastic transitions via decomposition and dual critics β making it particularly effective in noisy, real-world-like settings.
Citation
[1] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-Critic Algorithms. In S.A. Solla, T.K. Leen, & K.-R. MΓΌller (Eds.), Advances in Neural Information Processing Systems, Vol. 12. MIT Press.
[2] Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality (2nd ed.). John Wiley & Sons.
[3] Felizardo, L. K., Fadda, E., Nascimento, M. C. V., Brandimarte, P., & Del-Moral-Hernandez, E. (2024). A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks. arXiv preprint arXiv:2504.05150. https://arxiv.org/pdf/2504.05150
[4] Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning (ICML).
[5] Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML).
[6] Hull, I. (2015). Approximate Dynamic Programming with Post-Decision States as a Solution Method for Dynamic Economic Models. Journal of Economic Dynamics and Control, 55, 57β70.
[7] Queeney, J., Paschalidis, I. C., & Cassandras, C. G. (2021). Uncertainty-Aware Policy Optimization: A Robust, Adaptive Trust Region Approach. In Proceedings of the AAAI Conference on Artificial Intelligence, 35(9), 9377β9385.
[8] Zhang, J., Zhang, Z., Han, S., & LΓΌ, S. (2022). Proximal Policy Optimization via Enhanced Exploration Efficiency. Information Sciences, 609, 750β765.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI