How PPO Actually Works

PPO is the algorithm doing most of the heavy lifting in modern RL right now, including the RLHF that fine-tunes large language models. The code looks almost too simple: collect some experience, run a few epochs of minibatch SGD on a clipped objective, repeat. That’s it.

But the form of that objective, the specific reason it works, doesn’t fall out of nowhere. It’s the end of a story that starts with vanilla policy gradients, hits a wall, gets rescued by some elegant theory (TRPO), and then gets pragmatically simplified into something an undergrad can implement (PPO).

This post tells that story without burying you in subscripts. Intuition first, math when it actually pays off.

Open Table of contents

Why RL is harder than the supervised learning you already know
The vanilla policy gradient, and why it falls over
A surrogate we can actually optimize
The bound that justifies everything downstream
TRPO: smart but heavy
PPO: the lazy clever version
What you actually run
Why PPO won
References

Why RL is harder than the supervised learning you already know

In supervised learning, your dataset sits still. Compute a loss, take a gradient step, repeat. Easy.

In RL, the data moves. The agent’s policy decides what actions get taken, which decides what states get visited, which decides what data shows up in the next batch. Update the policy and you’ve changed the dataset. The whole optimization problem is non-stationary by design.

That single fact is the source of basically every difficulty in RL. Most clever algorithms are really about one question: how fast can we let the policy change without the data distribution running away from us?

The vanilla policy gradient, and why it falls over

The classic policy gradient says: increase the probability of actions that worked out better than expected, decrease the probability of actions that worked out worse.

\nabla_\theta J(\pi_\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a \mid s)\, A^{\pi_\theta}(s, a)\right].

The advantage $A^\pi(s, a)$ is the key quantity: how much better was this action than my policy’s average action in this state? Positive means we want more of it, negative means less. The estimator is unbiased and any deep learning framework can compute it. See Sutton et al. (2000) for the original derivation.

The gradient is unbiased, easy to compute, and a great idea. It also doesn’t work very well in practice. Here’s why.

The gradient lives in parameter space. What we care about is policy space. Two networks with nearly identical weights can produce drastically different action distributions if the softmax is anywhere near saturation. So a “small” step in $\theta$ can be a huge leap in $\pi_\theta$ . The policy starts visiting states the value estimator has never seen, the advantage estimates become noise, and the next gradient step makes things worse. This is the failure mode anyone who has tried to train REINFORCE has seen firsthand.

The fix, conceptually, is to control how far the policy moves as a distribution, not just in parameter space. The natural gradient of Kakade (2001) was the first serious attempt; it shows up again later inside TRPO.

A surrogate we can actually optimize

Call the current policy $\pi$ and the candidate new one $\tilde\pi$ . There’s a beautiful identity called the performance difference lemma, due to Kakade and Langford (2002):

J(\tilde\pi) - J(\pi) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^{\tilde\pi},\, a \sim \tilde\pi}\left[A^\pi(s, a)\right].

This is exact. The improvement from $\pi$ to $\tilde\pi$ equals the expected advantage of the new policy under the new state distribution.

The catch: we don’t have rollouts from $\tilde\pi$ . We have rollouts from $\pi$ . So we cheat. We swap in the old state distribution and use importance sampling on actions:

L_\pi(\tilde\pi) = J(\pi) + \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi,\, a \sim \pi}\left[\frac{\tilde\pi(a \mid s)}{\pi(a \mid s)}\, A^\pi(s, a)\right].

Everything in there can be computed from data we already have.

This $L_\pi$ matches the true objective $J$ in value and slope when $\tilde\pi = \pi$ , but it diverges as the policies separate. So it’s a good local approximation that gets less reliable the farther you push.

Inside the shaded neighborhood, climbing the surrogate also climbs $J$ . Outside, the surrogate becomes optimistic and you can wander off a cliff. The whole question of safe policy improvement boils down to: how far is too far?

The bound that justifies everything downstream

Schulman et al., building on Kakade and Langford, showed in the TRPO paper:

J(\tilde\pi) \geq L_\pi(\tilde\pi) - C \cdot D_{\mathrm{KL}}^{\max}(\pi, \tilde\pi).

Plain English: the surrogate, minus a penalty proportional to how much the new policy disagrees with the old one, is a lower bound on the true objective.

If you maximize that penalized surrogate at every step, you’re guaranteed to monotonically improve the real objective. No oscillations, no collapse.

In theory. The constant $C$ is enormous, the resulting steps are microscopic, and nobody runs the algorithm exactly as stated. But the shape of the recipe is right: maximize $L_\pi$ subject to a constraint on KL divergence between old and new.

That recipe, with sane constants, is TRPO.

TRPO: smart but heavy

TRPO solves:

\max_\theta\, \mathbb{E}\left[\frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)}\, A^{\pi_{\text{old}}}(s, a)\right] \quad \text{s.t.} \quad \mathbb{E}\left[D_{\mathrm{KL}}\!\left(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta\right)\right] \leq \delta.

To make this tractable, it linearizes the objective and takes a quadratic approximation of the KL constraint. The optimal step ends up being in the direction $F^{-1} g$ , where $g$ is the policy gradient and $F$ is the Fisher information matrix of the policy. That direction has a name: the natural gradient. It’s the policy gradient corrected for the geometry of distribution space. The full step is

\theta - \theta_{\text{old}} = \sqrt{\frac{2\delta}{g^\top F^{-1} g}}\, F^{-1} g,

i.e. the largest natural-gradient step that respects the (linearized) trust region.

In practice TRPO never builds $F$ explicitly. It uses conjugate gradients with Hessian-vector products to compute $F^{-1} g$ , then runs a backtracking line search to confirm the actual KL stays under $\delta$ and the surrogate genuinely improved.

It works. On continuous control benchmarks it gave clean, stable learning curves. But it has real costs. The conjugate gradient inner loop and the line search make every update slow. Hessian-vector products against the policy KL don’t compose well with shared policy / value networks, since the Fisher matrix only sees the policy. And it’s awkward with dropout, weight sharing, and most modern deep learning conveniences.

So the natural question: can we keep the spirit of the trust region while throwing out all the second-order machinery?

PPO: the lazy clever version

Yes. The trick is to enforce the trust region not through a constrained optimization, but through the shape of the loss function itself. This is what Schulman et al. (2017) did with PPO.

Define the probability ratio:

r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}.

So $r_t = 1$ at the start of every update. The TRPO-style surrogate is just $\mathbb{E}_t[r_t \hat A_t]$ , where $\hat A_t$ is your favorite advantage estimate (in practice, GAE).

PPO’s clipped objective is:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\!\left(r_t(\theta)\, \hat A_t,\; \mathrm{clip}\!\left(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\right) \hat A_t\right)\right].

Two terms inside the $\min$ . The unclipped surrogate. And a clipped version where $r_t$ is pinned to $[1 - \epsilon, 1 + \epsilon]$ . We take whichever is smaller, which makes the whole thing a pessimistic version of the unclipped objective.

What does that pessimism do? Walk through it:

Good action ( $\hat A_t > 0$ ). The gradient wants to push $r_t$ up, making the action more likely. Once $r_t$ goes past $1 + \epsilon$ , the clip caps the surrogate. The curve goes flat. The gradient zeros out. We’ve moved enough on this sample, stop pushing.

Bad action ( $\hat A_t < 0$ ). The gradient wants to push $r_t$ down. Once $r_t < 1 - \epsilon$ , same story. Clip kicks in, gradient dies, we stop.

The clip is asymmetric on purpose. If a good action got less likely under the new policy ( $\hat A_t > 0$ but $r_t < 1$ ), the clip does not kick in. We want to undo that mistake. The gradient flows freely back toward $\pi_{\text{old}}$ .

So the clip is one-sided: it kills gradient only in the direction that would push the policy further from $\pi_{\text{old}}$ , and only when we’d be over-exploiting the surrogate. Pulling back toward the old policy is always allowed.

This is a per-sample, soft trust region. It’s much cheaper than TRPO’s conjugate gradient solve, and it survives multiple epochs of SGD on the same batch, which is where PPO gets its sample efficiency.

A useful sanity check: at $\theta = \theta_{\text{old}}$ , $r_t = 1$ sits in the interior of the clip range, so the gradient of $L^{\text{CLIP}}$ is exactly the policy gradient. PPO and vanilla PG agree on infinitesimal steps. They differ only on finite ones, which is the whole point.

What you actually run

In practice the full PPO loss combines the clipped surrogate, a value function loss, and an entropy bonus:

L_t^{\text{PPO}}(\theta) = L_t^{\text{CLIP}}(\theta) - c_1 \left(V_\theta(s_t) - V_t^{\text{targ}}\right)^2 + c_2\, S\!\left[\pi_\theta\right](s_t).

The loop is short:

Roll out $\pi_{\theta_{\text{old}}}$ for $T$ steps across $N$ parallel actors.
Compute advantages with GAE.
Run $K$ epochs of minibatch SGD on $L^{\text{PPO}}$ .
Set $\theta_{\text{old}} \leftarrow \theta$ and go back to step 1.

The reason you can run multiple epochs on the same batch without blowing up the policy is exactly the clip: the trust region is enforced at the per-sample level, not at the optimization-step level.

Why PPO won

No theorem says PPO improves monotonically. The clipped surrogate is a smart heuristic, not a proven lower bound. But empirically it matched or beat TRPO across the standard benchmarks, and it fits in a few hundred lines of code.

The full pitch: first-order optimizer, no Fisher matrix, plays nicely with everything in the modern deep learning stack, reasonably robust to hyperparameter choice. That combination is why PPO became the default for everything from MuJoCo to RLHF.

One honest caveat. The paper What Matters in On-Policy Reinforcement Learning? (Andrychowicz et al., 2020) ablates PPO carefully and finds a lot of its reputation comes from small implementation choices: advantage normalization, value clipping, learning rate annealing, orthogonal initialization. The clipped surrogate is the headline, but the supporting details matter more than people usually admit. Worth reading before you trust your own ablations.

References

Sutton, McAllester, Singh, Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. NeurIPS 1999. link
Kakade. A Natural Policy Gradient. NeurIPS 2001. link
Kakade and Langford. Approximately Optimal Approximate Reinforcement Learning. ICML 2002. link
Schulman, Levine, Moritz, Jordan, Abbeel. Trust Region Policy Optimization. ICML 2015. arXiv:1502.05477
Schulman, Moritz, Levine, Jordan, Abbeel. High-Dimensional Continuous Control Using Generalized Advantage Estimation. ICLR 2016. arXiv:1506.02438
Schulman, Wolski, Dhariwal, Radford, Klimov. Proximal Policy Optimization Algorithms. 2017. arXiv:1707.06347
Andrychowicz et al. What Matters in On-Policy Reinforcement Learning? A Large-Scale Empirical Study. 2020. arXiv:2006.05990