PPO is the algorithm doing most of the heavy lifting in modern RL right now, including the RLHF that fine-tunes large language models. The code looks almost too simple: collect some experience, run a few epochs of minibatch SGD on a clipped objective, repeat. That’s it.
But the form of that objective, the specific reason it works, doesn’t fall out of nowhere. It’s the end of a story that starts with vanilla policy gradients, hits a wall, gets rescued by some elegant theory (TRPO), and then gets pragmatically simplified into something an undergrad can implement (PPO).
This post tells that story without burying you in subscripts. Intuition first, math when it actually pays off.
Table of contents
Open Table of contents
Why RL is harder than the supervised learning you already know
In supervised learning, your dataset sits still. Compute a loss, take a gradient step, repeat. Easy.
In RL, the data moves. The agent’s policy decides what actions get taken, which decides what states get visited, which decides what data shows up in the next batch. Update the policy and you’ve changed the dataset. The whole optimization problem is non-stationary by design.
That single fact is the source of basically every difficulty in RL. Most clever algorithms are really about one question: how fast can we let the policy change without the data distribution running away from us?
The vanilla policy gradient, and why it falls over
The classic policy gradient says: increase the probability of actions that worked out better than expected, decrease the probability of actions that worked out worse.
The advantage is the key quantity: how much better was this action than my policy’s average action in this state? Positive means we want more of it, negative means less. The estimator is unbiased and any deep learning framework can compute it. See Sutton et al. (2000) for the original derivation.
The gradient is unbiased, easy to compute, and a great idea. It also doesn’t work very well in practice. Here’s why.
The gradient lives in parameter space. What we care about is policy space. Two networks with nearly identical weights can produce drastically different action distributions if the softmax is anywhere near saturation. So a “small” step in can be a huge leap in . The policy starts visiting states the value estimator has never seen, the advantage estimates become noise, and the next gradient step makes things worse. This is the failure mode anyone who has tried to train REINFORCE has seen firsthand.
The fix, conceptually, is to control how far the policy moves as a distribution, not just in parameter space. The natural gradient of Kakade (2001) was the first serious attempt; it shows up again later inside TRPO.
A surrogate we can actually optimize
Call the current policy and the candidate new one . There’s a beautiful identity called the performance difference lemma, due to Kakade and Langford (2002):
This is exact. The improvement from to equals the expected advantage of the new policy under the new state distribution.
The catch: we don’t have rollouts from . We have rollouts from . So we cheat. We swap in the old state distribution and use importance sampling on actions:
Everything in there can be computed from data we already have.
This matches the true objective in value and slope when , but it diverges as the policies separate. So it’s a good local approximation that gets less reliable the farther you push.
Inside the shaded neighborhood, climbing the surrogate also climbs . Outside, the surrogate becomes optimistic and you can wander off a cliff. The whole question of safe policy improvement boils down to: how far is too far?
The bound that justifies everything downstream
Schulman et al., building on Kakade and Langford, showed in the TRPO paper:
Plain English: the surrogate, minus a penalty proportional to how much the new policy disagrees with the old one, is a lower bound on the true objective.
If you maximize that penalized surrogate at every step, you’re guaranteed to monotonically improve the real objective. No oscillations, no collapse.
In theory. The constant is enormous, the resulting steps are microscopic, and nobody runs the algorithm exactly as stated. But the shape of the recipe is right: maximize subject to a constraint on KL divergence between old and new.
That recipe, with sane constants, is TRPO.
TRPO: smart but heavy
TRPO solves:
To make this tractable, it linearizes the objective and takes a quadratic approximation of the KL constraint. The optimal step ends up being in the direction , where is the policy gradient and is the Fisher information matrix of the policy. That direction has a name: the natural gradient. It’s the policy gradient corrected for the geometry of distribution space. The full step is
i.e. the largest natural-gradient step that respects the (linearized) trust region.
In practice TRPO never builds explicitly. It uses conjugate gradients with Hessian-vector products to compute , then runs a backtracking line search to confirm the actual KL stays under and the surrogate genuinely improved.
It works. On continuous control benchmarks it gave clean, stable learning curves. But it has real costs. The conjugate gradient inner loop and the line search make every update slow. Hessian-vector products against the policy KL don’t compose well with shared policy / value networks, since the Fisher matrix only sees the policy. And it’s awkward with dropout, weight sharing, and most modern deep learning conveniences.
So the natural question: can we keep the spirit of the trust region while throwing out all the second-order machinery?
PPO: the lazy clever version
Yes. The trick is to enforce the trust region not through a constrained optimization, but through the shape of the loss function itself. This is what Schulman et al. (2017) did with PPO.
Define the probability ratio:
So at the start of every update. The TRPO-style surrogate is just , where is your favorite advantage estimate (in practice, GAE).
PPO’s clipped objective is:
Two terms inside the . The unclipped surrogate. And a clipped version where is pinned to . We take whichever is smaller, which makes the whole thing a pessimistic version of the unclipped objective.
What does that pessimism do? Walk through it:
Good action (). The gradient wants to push up, making the action more likely. Once goes past , the clip caps the surrogate. The curve goes flat. The gradient zeros out. We’ve moved enough on this sample, stop pushing.
Bad action (). The gradient wants to push down. Once , same story. Clip kicks in, gradient dies, we stop.
The clip is asymmetric on purpose. If a good action got less likely under the new policy ( but ), the clip does not kick in. We want to undo that mistake. The gradient flows freely back toward .
So the clip is one-sided: it kills gradient only in the direction that would push the policy further from , and only when we’d be over-exploiting the surrogate. Pulling back toward the old policy is always allowed.
This is a per-sample, soft trust region. It’s much cheaper than TRPO’s conjugate gradient solve, and it survives multiple epochs of SGD on the same batch, which is where PPO gets its sample efficiency.
A useful sanity check: at , sits in the interior of the clip range, so the gradient of is exactly the policy gradient. PPO and vanilla PG agree on infinitesimal steps. They differ only on finite ones, which is the whole point.
What you actually run
In practice the full PPO loss combines the clipped surrogate, a value function loss, and an entropy bonus:
The loop is short:
- Roll out for steps across parallel actors.
- Compute advantages with GAE.
- Run epochs of minibatch SGD on .
- Set and go back to step 1.
The reason you can run multiple epochs on the same batch without blowing up the policy is exactly the clip: the trust region is enforced at the per-sample level, not at the optimization-step level.
Why PPO won
No theorem says PPO improves monotonically. The clipped surrogate is a smart heuristic, not a proven lower bound. But empirically it matched or beat TRPO across the standard benchmarks, and it fits in a few hundred lines of code.
The full pitch: first-order optimizer, no Fisher matrix, plays nicely with everything in the modern deep learning stack, reasonably robust to hyperparameter choice. That combination is why PPO became the default for everything from MuJoCo to RLHF.
One honest caveat. The paper What Matters in On-Policy Reinforcement Learning? (Andrychowicz et al., 2020) ablates PPO carefully and finds a lot of its reputation comes from small implementation choices: advantage normalization, value clipping, learning rate annealing, orthogonal initialization. The clipped surrogate is the headline, but the supporting details matter more than people usually admit. Worth reading before you trust your own ablations.
References
- Sutton, McAllester, Singh, Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS 1999. link
- Kakade. A Natural Policy Gradient. NeurIPS 2001. link
- Kakade and Langford. Approximately Optimal Approximate Reinforcement Learning. ICML 2002. link
- Schulman, Levine, Moritz, Jordan, Abbeel. Trust Region Policy Optimization. ICML 2015. arXiv:1502.05477
- Schulman, Moritz, Levine, Jordan, Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016. arXiv:1506.02438
- Schulman, Wolski, Dhariwal, Radford, Klimov. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347
- Andrychowicz et al. What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. 2020. arXiv:2006.05990