Skip to content
Go back

How PPO Actually Works

Published:

PPO is the algorithm doing most of the heavy lifting in modern RL right now, including the RLHF that fine-tunes large language models. The code looks almost too simple: collect some experience, run a few epochs of minibatch SGD on a clipped objective, repeat. That’s it.

But the form of that objective, the specific reason it works, doesn’t fall out of nowhere. It’s the end of a story that starts with vanilla policy gradients, hits a wall, gets rescued by some elegant theory (TRPO), and then gets pragmatically simplified into something an undergrad can implement (PPO).

This post tells that story without burying you in subscripts. Intuition first, math when it actually pays off.

Table of contents

Open Table of contents

Why RL is harder than the supervised learning you already know

In supervised learning, your dataset sits still. Compute a loss, take a gradient step, repeat. Easy.

In RL, the data moves. The agent’s policy decides what actions get taken, which decides what states get visited, which decides what data shows up in the next batch. Update the policy and you’ve changed the dataset. The whole optimization problem is non-stationary by design.

Agent policy π(a | s) Environment dynamics P(s' | s, a) action a_t state s_t+1, reward r_t

That single fact is the source of basically every difficulty in RL. Most clever algorithms are really about one question: how fast can we let the policy change without the data distribution running away from us?

The vanilla policy gradient, and why it falls over

The classic policy gradient says: increase the probability of actions that worked out better than expected, decrease the probability of actions that worked out worse.

θJ(πθ)=E[θlogπθ(as)Aπθ(s,a)].\nabla_\theta J(\pi_\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a \mid s)\, A^{\pi_\theta}(s, a)\right].

The advantage Aπ(s,a)A^\pi(s, a) is the key quantity: how much better was this action than my policy’s average action in this state? Positive means we want more of it, negative means less. The estimator is unbiased and any deep learning framework can compute it. See Sutton et al. (2000) for the original derivation.

The gradient is unbiased, easy to compute, and a great idea. It also doesn’t work very well in practice. Here’s why.

The gradient lives in parameter space. What we care about is policy space. Two networks with nearly identical weights can produce drastically different action distributions if the softmax is anywhere near saturation. So a “small” step in θ\theta can be a huge leap in πθ\pi_\theta. The policy starts visiting states the value estimator has never seen, the advantage estimates become noise, and the next gradient step makes things worse. This is the failure mode anyone who has tried to train REINFORCE has seen firsthand.

The fix, conceptually, is to control how far the policy moves as a distribution, not just in parameter space. The natural gradient of Kakade (2001) was the first serious attempt; it shows up again later inside TRPO.

A surrogate we can actually optimize

Call the current policy π\pi and the candidate new one π~\tilde\pi. There’s a beautiful identity called the performance difference lemma, due to Kakade and Langford (2002):

J(π~)J(π)=11γEsdπ~,aπ~[Aπ(s,a)].J(\tilde\pi) - J(\pi) = \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^{\tilde\pi},\, a \sim \tilde\pi}\left[A^\pi(s, a)\right].

This is exact. The improvement from π\pi to π~\tilde\pi equals the expected advantage of the new policy under the new state distribution.

The catch: we don’t have rollouts from π~\tilde\pi. We have rollouts from π\pi. So we cheat. We swap in the old state distribution and use importance sampling on actions:

Lπ(π~)=J(π)+11γEsdπ,aπ[π~(as)π(as)Aπ(s,a)].L_\pi(\tilde\pi) = J(\pi) + \frac{1}{1-\gamma}\, \mathbb{E}_{s \sim d^\pi,\, a \sim \pi}\left[\frac{\tilde\pi(a \mid s)}{\pi(a \mid s)}\, A^\pi(s, a)\right].

Everything in there can be computed from data we already have.

This LπL_\pi matches the true objective JJ in value and slope when π~=π\tilde\pi = \pi, but it diverges as the policies separate. So it’s a good local approximation that gets less reliable the farther you push.

π_old policy π̃ (distance from π_old) objective surrogate L_π true J trust region

Inside the shaded neighborhood, climbing the surrogate also climbs JJ. Outside, the surrogate becomes optimistic and you can wander off a cliff. The whole question of safe policy improvement boils down to: how far is too far?

The bound that justifies everything downstream

Schulman et al., building on Kakade and Langford, showed in the TRPO paper:

J(π~)Lπ(π~)CDKLmax(π,π~).J(\tilde\pi) \geq L_\pi(\tilde\pi) - C \cdot D_{\mathrm{KL}}^{\max}(\pi, \tilde\pi).

Plain English: the surrogate, minus a penalty proportional to how much the new policy disagrees with the old one, is a lower bound on the true objective.

If you maximize that penalized surrogate at every step, you’re guaranteed to monotonically improve the real objective. No oscillations, no collapse.

In theory. The constant CC is enormous, the resulting steps are microscopic, and nobody runs the algorithm exactly as stated. But the shape of the recipe is right: maximize LπL_\pi subject to a constraint on KL divergence between old and new.

That recipe, with sane constants, is TRPO.

TRPO: smart but heavy

TRPO solves:

maxθE[πθ(as)πθold(as)Aπold(s,a)]s.t.E[DKL ⁣(πθoldπθ)]δ.\max_\theta\, \mathbb{E}\left[\frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)}\, A^{\pi_{\text{old}}}(s, a)\right] \quad \text{s.t.} \quad \mathbb{E}\left[D_{\mathrm{KL}}\!\left(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta\right)\right] \leq \delta.

To make this tractable, it linearizes the objective and takes a quadratic approximation of the KL constraint. The optimal step ends up being in the direction F1gF^{-1} g, where gg is the policy gradient and FF is the Fisher information matrix of the policy. That direction has a name: the natural gradient. It’s the policy gradient corrected for the geometry of distribution space. The full step is

θθold=2δgF1gF1g,\theta - \theta_{\text{old}} = \sqrt{\frac{2\delta}{g^\top F^{-1} g}}\, F^{-1} g,

i.e. the largest natural-gradient step that respects the (linearized) trust region.

In practice TRPO never builds FF explicitly. It uses conjugate gradients with Hessian-vector products to compute F1gF^{-1} g, then runs a backtracking line search to confirm the actual KL stays under δ\delta and the surrogate genuinely improved.

It works. On continuous control benchmarks it gave clean, stable learning curves. But it has real costs. The conjugate gradient inner loop and the line search make every update slow. Hessian-vector products against the policy KL don’t compose well with shared policy / value networks, since the Fisher matrix only sees the policy. And it’s awkward with dropout, weight sharing, and most modern deep learning conveniences.

So the natural question: can we keep the spirit of the trust region while throwing out all the second-order machinery?

PPO: the lazy clever version

Yes. The trick is to enforce the trust region not through a constrained optimization, but through the shape of the loss function itself. This is what Schulman et al. (2017) did with PPO.

Define the probability ratio:

rt(θ)=πθ(atst)πθold(atst).r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}.

So rt=1r_t = 1 at the start of every update. The TRPO-style surrogate is just Et[rtA^t]\mathbb{E}_t[r_t \hat A_t], where A^t\hat A_t is your favorite advantage estimate (in practice, GAE).

PPO’s clipped objective is:

LCLIP(θ)=Et[min ⁣(rt(θ)A^t,  clip ⁣(rt(θ),1ϵ,1+ϵ)A^t)].L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\!\left(r_t(\theta)\, \hat A_t,\; \mathrm{clip}\!\left(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\right) \hat A_t\right)\right].

Two terms inside the min\min. The unclipped surrogate. And a clipped version where rtr_t is pinned to [1ϵ,1+ϵ][1 - \epsilon, 1 + \epsilon]. We take whichever is smaller, which makes the whole thing a pessimistic version of the unclipped objective.

What does that pessimism do? Walk through it:

advantage  > 0 r_t(θ) 1−ε 1 1+ε L^CLIP_t clip kicks in advantage  < 0 r_t(θ) 1−ε 1 1+ε L^CLIP_t clip kicks in

Good action (A^t>0\hat A_t > 0). The gradient wants to push rtr_t up, making the action more likely. Once rtr_t goes past 1+ϵ1 + \epsilon, the clip caps the surrogate. The curve goes flat. The gradient zeros out. We’ve moved enough on this sample, stop pushing.

Bad action (A^t<0\hat A_t < 0). The gradient wants to push rtr_t down. Once rt<1ϵr_t < 1 - \epsilon, same story. Clip kicks in, gradient dies, we stop.

The clip is asymmetric on purpose. If a good action got less likely under the new policy (A^t>0\hat A_t > 0 but rt<1r_t < 1), the clip does not kick in. We want to undo that mistake. The gradient flows freely back toward πold\pi_{\text{old}}.

So the clip is one-sided: it kills gradient only in the direction that would push the policy further from πold\pi_{\text{old}}, and only when we’d be over-exploiting the surrogate. Pulling back toward the old policy is always allowed.

This is a per-sample, soft trust region. It’s much cheaper than TRPO’s conjugate gradient solve, and it survives multiple epochs of SGD on the same batch, which is where PPO gets its sample efficiency.

A useful sanity check: at θ=θold\theta = \theta_{\text{old}}, rt=1r_t = 1 sits in the interior of the clip range, so the gradient of LCLIPL^{\text{CLIP}} is exactly the policy gradient. PPO and vanilla PG agree on infinitesimal steps. They differ only on finite ones, which is the whole point.

What you actually run

In practice the full PPO loss combines the clipped surrogate, a value function loss, and an entropy bonus:

LtPPO(θ)=LtCLIP(θ)c1(Vθ(st)Vttarg)2+c2S ⁣[πθ](st).L_t^{\text{PPO}}(\theta) = L_t^{\text{CLIP}}(\theta) - c_1 \left(V_\theta(s_t) - V_t^{\text{targ}}\right)^2 + c_2\, S\!\left[\pi_\theta\right](s_t).

The loop is short:

  1. Roll out πθold\pi_{\theta_{\text{old}}} for TT steps across NN parallel actors.
  2. Compute advantages with GAE.
  3. Run KK epochs of minibatch SGD on LPPOL^{\text{PPO}}.
  4. Set θoldθ\theta_{\text{old}} \leftarrow \theta and go back to step 1.

The reason you can run multiple epochs on the same batch without blowing up the policy is exactly the clip: the trust region is enforced at the per-sample level, not at the optimization-step level.

Why PPO won

No theorem says PPO improves monotonically. The clipped surrogate is a smart heuristic, not a proven lower bound. But empirically it matched or beat TRPO across the standard benchmarks, and it fits in a few hundred lines of code.

The full pitch: first-order optimizer, no Fisher matrix, plays nicely with everything in the modern deep learning stack, reasonably robust to hyperparameter choice. That combination is why PPO became the default for everything from MuJoCo to RLHF.

One honest caveat. The paper What Matters in On-Policy Reinforcement Learning? (Andrychowicz et al., 2020) ablates PPO carefully and finds a lot of its reputation comes from small implementation choices: advantage normalization, value clipping, learning rate annealing, orthogonal initialization. The clipped surrogate is the headline, but the supporting details matter more than people usually admit. Worth reading before you trust your own ablations.

References