Intro to RL: Off-Policy Methods

2 hours ago 1

So far in this series on reinforcement learning, we’ve covered classical methods and the foundations of continuous-control methods.

Let’s say you want to use these methods at scale. Ideally, we’d have a method to run as many actors as we want, alongside some way to consolidate the results.

Basic knowledge of PPO tells us that we can do this as long as the actors are using a policy that isn’t “too far” from the latest policy π_current. But that restriction inherently caps how many actors we can run concurrently. If an update drifts π_current too far, then all of the other actors’ work becomes much less valuable.

This naturally leads us to the “off-policy” methods: DDPG, TD3, and SAC.

One nice thing about Q-learning was that, in theory, you could fill out the state-action table in parallel. Work was never really wasted, because every visit to an (s, a) pair contributed information toward the optimal value function.

Of course, the argument for convergence of Q-learning relied on the contraction property of the Bellman update operator. That’s going to be harder to prove in a continuous action space, because we have to use something like a neural net to output Q-values (DQN), meaning we can’t guarantee that the policy is strictly improving like in the tabular method. In fact, there is no clean convergence proof for DQN.

Regardless, the question remains: is there a reasonable method that never “throws away” old samples and can use every (s, a) collected? Note that the reason we couldn’t use Q-learning in a continuous action space is that we couldn’t solve the maxₐ in this update:

\(Q(s, a) \leftarrow α Q(s, a) + (1 - α)(r + \max_{a’} Q(s’, a’))\)

What if we try just outputting maxₐ Q(s′, a′) directly? Is it possible to build an algorithm around this?

The deterministic policy gradient theorem is a way to differentiate our objective function by integrating over states rather than actions. Let’s define a “deterministic policy” μ:

and its objective:

\(J(θ) := 𝔼_{s∼d^{μ_θ}}[Q^{μ_θ}(s, μ_θ(s))]\)

Our goal is to compute ∇_θJ(θ) so we can perform gradient ascent. Starting from:

\( J(θ) = \sum_t γ^t r(s_t, μ_θ(s_t)) = \sum_s \left(\sum_t γ^t \Pr[s_t=s | μ_θ]\right) r(s, μ_θ(s))\)

Direct differentiation is messy because Pr[s_t= s] depends on θ. We’d like to express this in terms of value functions instead, and eliminate the Pr[s_t= s] term.

Intuitively, this discounted and probability-weighted summation of rewards across each state is equivalent to the expected value of starting the game at all: 𝔼_{s_0}[V(s₀)]. Let’s prove that equivalence formally.

We relate r and V via the Bellman equation:

\(V^{μ_θ}(s) := r(s, μ_θ(s)) + γ 𝔼_{s’}[V^{μ_θ}(s’)]\)

and define the “discounted state visitation”:

\( p_θ(s) := \sum_t γ^t \Pr[s_t=s]\)

Then:

\(\nabla J(θ) = \nabla \sum_s p_θ(s) r(s, μ_θ(s)) = \nabla \sum_s p_θ(s)[V^{μ_θ}(s) - γ 𝔼_{s’}[V^{μ_θ}(s’)]]\)

So we want an expression for ∇Σₛ p_θ(s)V^μ_θ(s) and/or ∇Σₛ p_θ(s)γ𝔼[V^μ_θ(s′)].

We expand transitions:

\(\Pr[s_{t+1}=s’] = \sum_s \Pr[s_t=s] \Pr[s’ | s, μ_θ(s)]\)

Sum over t:

\( \sum_t \Pr[s_{t+1}=s’] = \sum_t \sum_s \Pr[s_t=s] \Pr[s’ | s, μ_θ(s)]\)

Apply discounting by γ:

\(\sum_t γ^{t+1} \Pr[s_{t+1}=s’] = \sum_t γ^{t+1} \sum_s \Pr[s_t=s] \Pr[s’ | s, μ_θ(s)]\)

Define:

The left-hand side is p_θ(s′) except it omits t = 0:

\(\begin{align} \sum_t \gamma^{t+1} \Pr[s_{t+1}=s'] &= \sum_{u=1}^\infty \gamma^u \Pr[s_u=s'] = p_\theta(s') - d_0(s') \\ p_\theta(s') &= d_0(s') + \gamma \sum_s p_\theta(s)\, \Pr[s' \mid s, \mu_\theta(s)] \end{align} \)

In words: Each state’s discounted occupancy p_θ(s′) consists of two parts: the initial-state contribution d₀(s′) and the γ-discounted flow of visitation mass from predecessor states, weighted by their transition probabilities under μ_θ.

It’s starting to look pretty close to what we want above.

Multiply both sides by V(s′) and sum over s′:

\(\sum_{s’} p_θ(s’) V(s’) = \sum_{s’} d_0(s’) V(s’) + γ \sum_s p_θ(s)\sum_{s’} \Pr[s’ | s, μ_θ(s)] V(s’)\)

Recognize that the inner sum is an expectation:

\(\sum_{s’} \Pr[s’ | s, μ_θ(s)] V(s’) = 𝔼_{s’}[V(s’)]\)

So:

\(\sum_{s’} p_θ(s’) V(s’) = \sum_{s’} d_0(s’) V(s’) + γ \sum_s p_θ(s) 𝔼_{s’}[V(s’)]\)

Now recall our earlier expression for the gradient, and substitute our expression above back in:

\(\begin{align} \nabla J(\theta) &= \nabla \sum_s p_\theta(s)\left[V^{\mu_\theta}(s) - \gamma\, \mathbb{E}_{s'}[V^{\mu_\theta}(s')]\right] \\ \nabla J(\theta) &= \nabla \left(\sum_{s'} d_0(s')\, V^{\mu_\theta}(s')\right) \end{align} \)

Or equivalently,

\(\nabla \left(\sum_s p_θ(s)V(s) - γ\sum_s p_θ(s)𝔼[V(s’)]\right) = \nabla 𝔼_{s_0}[V(s_0)]\)

Now the expression is much friendlier. How do we differentiate V(s)? Since V(s) = Q(s, μ_θ(s)) for deterministic policies, we apply the multivariable chain rule:

\(\begin{align} \nabla_\theta \mathbb{E}_{s_0}[Q(s_0, \mu_\theta(s_0))] &= \sum_s \Pr[s_0 = s] \big[ \nabla_\theta Q(s, a)\big|_{a=\mu_\theta(s)} \\ &\quad + \nabla_a Q(s, a)\big|_{a=\mu_\theta(s)} \nabla_\theta \mu_\theta(s) \big] \end{align} \)

The first term, ∇_θQ(s,a), is recursive with respect to the original gradient ∇J via the Bellman equation, except now it’s over the distribution of s₁ rather than the starting distribution of s₀:

\(\nabla_θ Q(s,a)_{|{a=μ_θ(s)}} = \nabla_θ[r(s,a) + γ 𝔼_{s’}V^{μ_θ}(s’)] = γ \nabla_θ 𝔼_{s’}[V^{μ_θ}(s’)]\)

Unrolling this recursion gives:

\(\begin{align*} \nabla_\theta J(\theta) &= \gamma\,\nabla_\theta \mathbb{E}_{s_1}[V^{\mu_\theta}(s_1)] + \nabla_a Q(s_0, a)\big|_{a=\mu_\theta(s_0)}\,\nabla_\theta \mu_\theta(s_0) \\[6pt] \nabla_\theta J(\theta) &= \gamma^2\,\nabla_\theta \mathbb{E}_{s_2}[V^{\mu_\theta}(s_2)] + \gamma\,\nabla_a Q(s_1, a)\big|_{a=\mu_\theta(s_1)}\,\nabla_\theta \mu_\theta(s_1) \\[-2pt] &\quad + \nabla_a Q(s_0, a)\big|_{a=\mu_\theta(s_0)}\,\nabla_\theta \mu_\theta(s_0) \end{align*} \)

Continuing indefinitely:

\(\nabla_\theta J(\theta) = \sum_{t=0}^{\infty} \gamma^t \, \mathbb{E}_{s_t} \!\left[ \nabla_a Q(s_t, a)\big|_{a=\mu_\theta(s_t)} \, \nabla_\theta \mu_\theta(s_t) \right]\)

Equivalently, using the discounted state-visitation form:

\( \nabla_θ J(θ) = 𝔼_{s∼p_θ}[\nabla_a Q(s,a)|_{a=μ_θ(s)} \nabla_θ μ_θ(s)]\)

This is the Deterministic Policy Gradient Theorem. The key difference from the traditional (stochastic) policy gradient theorem is that the expectation is taken over the state distribution rather than the action distribution.

That single shift, integrating over p_θ(s) instead of π(a|s), makes deterministic continuous control methods tractable and forms the foundation for DDPG and its successors.

We can’t compute this expectation analytically, so we approximate Q and μ with neural networks.

The first relaxation that we need to make, in line with our effort to increase sample efficiency, is to allow ourselves to use samples that don’t necessarily come from the current state distribution p_θ. The original DPG paper shows that for any sampling distribution p with sufficient coverage:

\(\nabla_θ J(θ) ∝ 𝔼_{s∼p}[\nabla_θ μ_θ(s) \nabla_a Q^{μ_θ}(s,a)|_{a=μ_θ(s)}]\)

Thus, we can reuse samples from “replay buffers,” where we store all of the previous samples that we’ve observed, even after our gradient has had many updates. This is the key to off-policy learning.

Second, DDPG uses two networks: the actor μ_θ and the critic Q_φ. During exploration, Gaussian noise is added to μ_θ(s). The critic is trained by minimizing:

\(L(φ) = \frac{1}{N} \sum (Q_φ(s_t,a_t) - y_t)^2\)

where

\(y_t = r_t + γ Q_{φ’}(s_{t+1}, μ_{θ’}(s_{t+1}))\)

and the actor loss is simply:

actions = actor(states) q_values = critic(states, actions) actor_loss = -q_values.mean() actor_loss.backward()

In practice, DDPG maintains frozen target networks Q_φ’ and μ_θ’. These networks are updated “softly”:

\(φ’ ← τφ + (1-τ)φ’, \quad θ’ ← τθ + (1-τ)θ’\)

for small τ. In practice, this reduces the variance of the network updates. But buyer beware: DDPG is still known to be very unstable and sensitive to hyperparameters.

TD3 is directly a response to DDPG. It makes three changes to DDPG, all starting with the letter D, two of which are fairly simple:

Double critics: Instead of one critic network, now we have two, and we use min(Q_{φ′₁}, Q_{φ′₂}). The reasoning is that Q-networks can be spiky and randomly assign too high of a value to some states. μ_θ(s) is computing arg maxₐ Q(s, a), which implicitly relies on maxₐ Q(s, a). This leads to systematic over-estimation of the true Q-value. By taking min(Q_{φ′₁}, Q_{φ′₂}), we counteract this overestimation bias and err toward underestimating the true objective function.
Delayed actor updates: The heuristic reasoning for convergence (not formally proven) is structurally identical to the convergence argument for policy iteration. We want the state values to converge first, and then we update the policy. TD3 solves this by only updating the actor once every two times the critic network is updated.

The last change to DDPG, called deterministic target smoothing, is more nuanced.

The critical insight here is that (s, a) is continuous, and therefore it’s extremely unlikely that you’ll ever hit the same (s, a) twice. That means that if you have a randomly high Q(s′, a′) from initialization, that spike will permanently distort that part of the Q-network. Worse, it propagates downstream through the Bellman update:

\(Q(s, a) ← r + γ Q(s', a')\)

This contamination affects nearby states too, since Q-networks generalize over continuous space. The double critics mitigate but don’t eliminate this problem.

We solve it by smoothing out that local, spurious (s, a) pair with its neighbors. Before, the target was:

\(y_t = r_t + γ \min_i Q_{φ'_{i}}(s_{t+1}, μ_{θ'}(s_{t+1}))\)

Instead, TD3 uses a smoothed target:

\(y_t = r_t + γ 𝔼_{\epsilon}\left[\min_{i}Q_{\phi'_i}(s_{t+1}, μ_{\theta'}(s_{t+1}) + \epsilon)\right], \quad \epsilon ∼ 𝒩(0, \sigma^2)\)

Maybe a single (s, a) pair has a random spike, but on average, the neighborhood of points is likely to be reasonable. Computing that expectation is intractable, but we can approximate it via Monte Carlo. In practice, a single ε-sample per update is sufficient to get a good estimate.

If we take a second-order Taylor expansion of Q around the mean action μ_θ′(s_t+1), we find that this expectation implicitly penalizes the Laplacian of Q with respect to its action input. That is, the added noise regularizes curvature, discouraging sharp spikes in Q-values that would otherwise destabilize learning.

Together, these three changes to DDPG make TD3 a stable and popular alternative.

Now it would be awesome if somehow SAC were a natural continuation of DDPG/TD3. That doesn’t do it justice, though, because SAC is actually philosophically and foundationally distinct from all of the methods that we’ve covered so far, even in previous posts.

The thing about our original objective function (J(θ) := 𝔼_τ[G(τ)]) is that it only cares about expected return, not diversity of exploration. All of the exploration we’ve done so far has been either by solving for the parameters of a stochastic policy (e.g. PPO) or by adding randomness to the actions (e.g. DDPG). The optimal policy is still allowed to collapse into a deterministic mapping.

The reality is that there are many possible policies that could explain our observations. Which one should we prefer?

The principle of maximum entropy tells us that, out of all the distributions consistent with our observations, we should pick the one with maximum entropy. This is the distribution that makes the fewest assumptions about the underlying data.

For example, if we have a variable x such that 𝔼_p[f(x)] = c, we should solve:

\(\max_p H(p) \quad \text{s.t.} \quad 𝔼_p[f(x)] = c\)

The Lagrangian is:

\(L(p, λ) = -\sum_x p(x)\log p(x) + λ(𝔼_p[f(x)] - c)\)

and setting its derivative to zero yields:

This is the Boltzmann distribution, the unique maximum-entropy distribution that satisfies the constraint.

Applying this idea to RL, if we assume we want some level of return R̂, then we should pick (in discrete form):

\(\max_q H(q) = -\sum_{\tau} q(\tau)\log q(\tau)\)

subject to

\(\sum_{\tau} q(\tau) = 1, \qquad 𝔼_q[R(\tau)] = \hat{R}\)

Here q(τ) is a hypothetical “best” trajectory distribution that reflects our observations, and 𝔼_q denotes expectation over trajectories induced by q. Of course, if we set R̂ arbitrarily high, the resulting entropy will approach 0.

Forming the Lagrangian and setting the derivative to 0:

\(L(q, λ, β) = -\sum_{τ} q(τ)\log q(τ) + λ\left(\sum_{τ} q(τ) - 1\right) + β\left(\sum_{τ} q(τ)R(τ) - \hat{R}\right)\)

Taking the derivative with respect to q(τ):

\(\frac{∂L}{∂q(τ)} = -(\log q(τ) + 1) + λ + β R(τ) = 0\)

which gives:

Setting α := 1/β, we obtain the canonical maximum entropy form used in SAC:

\(q(τ) ∝ \exp\left(\tfrac{1}{α}\sum_t r(s_t, a_t)\right)\)

The last missing piece is that we’ve defined what the optimal trajectory distribution looks like, but the agent only controls the policy distribution. The environment dynamics also affect the likelihood of any trajectory:

\(p_0(τ) = p(s_0)\prod_t p(s_{t+1}|s_t, a_t)\)

Using Bayes’ rule (Pr[A | B] = Pr[A and B] / Pr[B]), we include this prior over trajectories:

\(q^*(τ) ∝ p_0(τ) \exp\left(\tfrac{1}{α}\sum_t r(s_t, a_t)\right)\)

So we’ve solved for the optimal trajectory distribution q*. But we can only control the policy π_θ(a|s), which induces its own trajectory distribution through the environment dynamics:

\(p_π(τ) = p(s_0)\prod_t π(a_t|s_t)p(s_{t+1}|s_t,a_t)\)

The question becomes: which policy-induced trajectory distribution p_π(τ) is closest to q(τ)*? This leads naturally to minimizing the KL divergence:

\(π^* = \arg\min_{π} D_{KL}(p_π(τ) || q^*(τ))\)

Expanding:

\(\begin{align} \pi^* &= \arg\min_\pi \mathbb{E}_{p_\pi}\!\left[\log p_\pi(\tau) - \log q^*(\tau)\right] \\[4pt] &= \arg\min_\pi \mathbb{E}_{p_\pi}\!\left[ \log p_\pi(\tau) - \log p_0(\tau) - \tfrac{1}{\alpha} \sum_t r(s_t, a_t) \right] \end{align} \)

Since log p_π(τ) − log p₀(τ) = ∑ₜ log π(a_t|s_t), this simplifies to:

\(π^* = \arg\min_π 𝔼_{p_π}\sum_t [\log π(a_t|s_t) - \tfrac{1}{α}r(s_t,a_t)]\)

Rewriting as a maximization gives:

\(π^* = \arg\max_π 𝔼_{p_π}\sum_t [r(s_t,a_t) + α H(π(·|s_t))]\)

since α is a constant so multiplying by it doesn’t change the outcome of the argmax. Thus, SAC explicitly maximizes both expected reward and policy entropy, balancing exploitation with exploration in a single unified framework.

To implement this, we’re going to need expressions for both state values and Q-values. The state value function is just the objective above, with discounting:

\(V^*(s) = \max_π 𝔼_{τ|s_0=s}\left[\sum_t γ^t(r(s_t,a_t) + α H(π(·|s_t)))\right]\)

By the Bellman equations:

\(V^*(s) = \max_π 𝔼_{a,s’}[r(s,a) + α H(π(·|s)) + γ V^*(s’)]\)

and the corresponding Q-function:

\(Q^*(s,a) = r(s,a) + γ 𝔼_{s’}[V^*(s’)]\)

We can rewrite V* in terms of Q*:

\(V^*(s) = \max_π 𝔼_{a}[Q^*(s,a) + α H(π(·|s))]\)

This form is friendlier because now we’re only taking an expectation over the action a. That expectation is equal to:

\(V^*(s) = \max_π \sum_a π(a|s)[Q^*(s,a) - α \log π(a|s)], \quad \text{s.t.}\ \sum_a π(a|s)=1\)

Let’s try to solve for that V*. We write the Lagrangian and set the derivative to 0:

\(L(π, λ) = \sum_a π(a|s)[Q^*(s,a) - α \log π(a|s)] + λ(\sum_a π(a|s) - 1)\)

gives:

\(Q^*(s,a) - α \log π^*(a|s) - α + λ = 0\)

Therefore:

\(\pi^*(a \mid s) \propto \exp\!\left(\frac{Q^*(s,a)}{\alpha}\right) \)

Normalizing yields the softmax policy:

\(\pi^*(a \mid s) = \frac{\exp\!\left(\tfrac{Q^*(s,a)}{\alpha}\right)} {\sum_{a'} \exp\!\left(\tfrac{Q^*(s,a')}{\alpha}\right)} \)

Now we can substitute back into V:

\(V^*(s) = \alpha \log \sum_a \exp\!\left(\frac{Q^*(s,a)}{\alpha}\right)\)

Finally, we need a way to computationally solve for the optimal policy pi*. We could theoretically fit a network to the definition of the optimal pi* above. But in continuous spaces that denominator is intractable. Instead, we define a “soft policy improvement” operator:

\(π_{new} = \arg\max_π 𝔼_{s∼D, a∼π}[Q_θ(s,a) + α H(π)]\)

This operator has the same fixed point as the optimal policy above. (You can try substituting it in.) In practice, we minimize its negative:

\(J_π(φ) = 𝔼_{a∼π}[α \log π(a|s) - Q_θ(s,a)]\)

The critic still follows a soft Bellman target:

\(Q^*(s,a) = r(s,a) + γ 𝔼_{s’}[V^*(s’)]\)

with:

\(V^*(s) = \alpha \log \sum_a \exp\!\left(\frac{Q^*(s,a)}{\alpha}\right) \)

Finally:

SAC uses double critics (like TD3) to mitigate bias, and
Updates the temperature α automatically to maintain a target entropy.

SAC’s soft Bellman operator is a γ-contraction in the tabular case, ensuring convergence under idealized assumptions.

That concludes our discussion of the off-policy methods. In the next and final post of the series, I’ll cover incorporating human feedback, which is relevant in post-training LLMs: DPO and GRPO.

Read Entire Article

Intro to RL: Off-Policy Methods

Related

Speech-to-Retrieval (S2R): A new approach to voice search

QJS – JavaScript in Go with QuickJS and Wazero

The Exhaustion of Permanent Disruption