Table of Contents

An Introduction to Reinforcement Learning

A comprehensive guide to modern Reinforcement Learning, from MDPs, through classic algorithms and Policy Gradients to Offline RL.

Published on .


Goal

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward

Modeling and Terminology

Basic reinforcement learning is modelled as a Markov decision process (MDP):

The agent’s action selection is modelled as a map called a policy function π:A×S[0,1]\pi:A \times S \rightarrow [0,1] where π(a,s)=Pr(at=ast=s)\pi(a,s) = Pr(a_t=a | s_t=s). The policy map gives the probability of taking action aa in state ss. The policy can alternatively be deterministic maps from ss to aa.

The goal of a reinforcement learning agent is to learn an optimal policy π\pi^* which maximises the expected cumulative reward.

The value function Vπ(s)V_\pi(s) (or state-value function) is defined as the expected return starting with state ss and successively following policy π\pi. Hence, roughly speaking, the value function estimates “how good” it is to be in a given state.

Vπ(s)=E[Rs0=s]=E[t=0γtrts0=s]V_\pi(s) = \mathop{\mathbb{E}}[R | s_0 = s] = \mathop{\mathbb{E}}[\sum_{t=0}^{\infty} \gamma^tr_t | s_0 = s] The value function quantifies the long-term desirability of states. It sums up all future rewards the agent can expect to receive when starting from state s and following policy π, with future rewards discounted by γ to prioritize immediate rewards.

Value VπV_\pi of a trajectory or policy π\pi at a state ss is the expected total reward RR from the state to the final destination; Here γ\gamma is the discount rate of future rewards

Total reward is R=t=0γtrtR=\sum_{t=0}^{\infty} \gamma^tr_t where γ[0,1)\gamma \in [0,1)

Given a state ss, action aa and policy π\pi, the action-value function of the pair (s,a)(s,a) under policy π\pi is the Q-function defined as

Qπ(s,a)=E[Rs,a,π]Q_\pi(s,a) = \mathop{\mathbb{E}}[R | s, a, \pi]

It is the expected total reward if we take action aa in state ss.

MDP Formulation

An MDP provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. The "Markov" property means that the future state depends only on the current state and action, not on the history of previous states.

RL problems are formulated as an infinite-horizon discounted Markov decision process (MDP), defined by the tuple (S,A,P,r,ρ0,γ)(S,A,P,r,ρ_0,γ), where SS is a finite set of states, AA is a finite set of actions, P:S×A×SRP : S × A × S \rightarrow \mathbb{R} is the transition probability distribution r:SRr:S→\mathbb{R} is the reward function, ρ0:SRρ_0 :S→\mathbb{R} is the distribution of the initial state s0s_0 , and γ(0,1)γ ∈ (0, 1) is the discount factor.

Let ππ denote a stochastic policy π:S×A[0,1]π : S × A → [0,1], and let η(π)η(π) denote its expected discounted reward.

Goal (Criterion of optimality)

The optimal policy π\pi^* is defined in terms of the state-value function as

V(s)=maxπVπ(s)V^*(s) = max_\pi V_\pi(s)

i.e the optimal policy maximises the total reward possible from state ss

MDP Algorithms

Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Both algorithms compute a sequence of functions Qk(k=0,1,2,)Q_k (k=0,1,2,…) that converge to QQ^*.

Computing these functions involves computing expectations over the whole state-space, which is impractical for all but the smallest (finite) MDPs. Instead, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces.

Both algorithms are based on the Bellman optimality equation.

Bellman Optimality

While this equation may look intimidating, it's essentially saying something intuitive: the best value we can achieve from a state equals the reward we get immediately plus the best value we can get from wherever we end up next. This recursive relationship is what makes dynamic programming solutions possible.

The Bellman optimality equation expresses the relationship between the value of a state and the values of its successor states under optimal behavior. It states that the value of a state under an optimal policy must equal the expected return for the best action from that state.

For the state-value function, the Bellman optimality equation is:

V(s)=maxasPa(s,s)[Ra(s,s)+γV(s)]V^*(s) = \max_a \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V^*(s')]

And for the action-value function:

Q(s,a)=sPa(s,s)[Ra(s,s)+γmaxaQ(s,a)]Q^*(s,a) = \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma \max_{a'} Q^*(s',a')]

The Bellman optimality equation is particularly important because its solution gives us the optimal value function, from which we can derive the optimal policy. Once we have VV^*, the optimal policy is to select actions that maximize the expected value:

π(s)=arg maxasPa(s,s)[Ra(s,s)+γV(s)]\pi^*(s) = \argmax_a \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V^*(s')]

Policy Iteration

Policy iteration alternates between two steps:
(1) Policy evaluation - compute the value function for the current policy, and
(2) Policy improvement - update the policy to choose actions that maximize value based on the new value function.

In policy iteration (Howard 1960), the state-value function V(s)V(s) is updated once, and then the policy π(s,)\pi(s, \cdot) is updated, then both are repeated until policy converges.

The algorithm starts at i=0i=0 with a guess of initial value and policy. The steps are repeated until the policy stops changing.

Vi+1(s):=sPπi(s)(s,s)[Rπi(s)(s,s)+γVi(s)]πi+1(s):=arg maxa[sPa(s,s)[Ra(s,s)+γVi+1(s)]]V_{i+1}(s) := \sum_{s'} P_{\pi_i(s)}(s,s')[R_{\pi_i(s)}(s,s') + \gamma V_i(s')] \newline \pi_{i+1}(s) := \argmax_a [\sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V_{i+1}(s')]]

Value Iteration

Value iteration directly computes the optimal value function by repeatedly applying the Bellman optimality equation. Unlike policy iteration, it skips explicit policy improvement steps and instead finds the optimal value function in one shot, from which the optimal policy can be derived.

In value iteration (Bellman 1957), which is also called backward induction, the policy π\pi is not used.

Vi+1(s):=maxa{sPa(s,s)[Ra(s,s)+γVi(s)]}V_{i+1}(s) := \max_a \Bigg\{ \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V_i(s')]\Bigg\}

Here, ii is the iteration number. Value iteration starts at i=0i=0 with a guess V0V_0 and iteratively applies the equation until LHS and RHS converges.

Stochastic Model-free Algorithms

When the complete model (MDP) of the environment is not available, the agent can learn from experience (i.e by interacting with the environment).

Monte Carlo Value Iteration

Monte Carlo methods are computational algorithms that rely on repeated random sampling to obtain numerical results. They're used to solve problems that might be deterministic in principle, by using randomness to approximate solutions, especially useful for complex systems with many coupled degrees of freedom.

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. Monte Carlo methods sample and average returns for each state-action pair and average rewards for each action.

They are typically applied to episodic tasks. The policy is updated only at the end of an episode.

(Note: Episodic Tasks are ones where the task terminates after some point. A game is an example)

si,aiπ(,)for one episodeRtotal(si)=j=iNR(sisj,aij)V(si)=avg(Rtotal(si))s_i,a_i\sim \pi(\cdot,\cdot) \quad for\ one\ episode \newline R_{total}(s_i) = \sum_{j=i}^N R(s_i\rightarrow s_j, a_{ij}) \newline V(s_i) = avg(R_{total}(s_i))

Temporal Difference Learning

TD methods only wait until the next time step to update the value estimates.

The update is done by transitioning from stst+1s_t \rightarrow s_{t+1} , collecting the reward rt+1r_{t+1} and then correcting V(st)V(s_t) using it.

The difference between expected reward r^t+1=V(st)γV(st+1)\hat{r}_{t+1} = V(s_t) - \gamma V(s_{t+1}) and observed reward rt+1r_{t+1} is fed back to V(st)V(s_t) and is called the “temporal difference (TD) error”.

V(st)V(st)+α(rt+1r^t+1)=V(st)+α(rt+1+γV(st+1)V(st))V(s_t) ← V(s_t)+α(r_{t+1}-\hat{r}_{t+1}) = V(s_t)+α(r_{t+1} +γV(s_{t+1})−V(s_t))

After the optimal V(s)V(s) is learnt, the policy π(,)\pi(\cdot,\cdot) is determined by either with a greedy policy where the action taken at any state maximises the total reward. Or with an ɛ-greedy policy where the best action is taken with probability ɛ and a random action with probability 1-ɛ

Forms of TD Error

Temporal difference can have many steps of rewards before using the estimated value V(st+k)V(s_{t+k}).

Q-Learning

Q-learning is similar TD-learning but the action-value (Q) function is updated instead of state-value (V) function.

Q(st,at)Q(st,at)+α(rt+1+γmaxaQ(st+1,a)Q(st,at))Q(s_t, a_t) \leftarrow Q(s_t,a_t) + \alpha (r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t,a_t))

Deep Q-Learning

This is the same as Q-learning but the Q-function is approximated using a neural network Q(s,a,θ)Q(s,a,\theta) where θ\theta are the weights of the network.

The loss function is called “mean squared TD error” (Sutton & Barto, 2018, p. 199) and is computed by squaring the TD-error.

Loss=[rt+1+γmaxaQ(st+1,a,θ)Q(st,at,θ)]2Loss = [r_{t+1} + \gamma \max_a Q(s_{t+1}, a,\theta) - Q(s_t, a_t,\theta)]^2 A stationary distribution is one that doesn't change over time - its statistical properties remain constant.

This loss function causes divergence because states are correlated, and rewards are non-stationary (for e.g, reward for same state changes over time).

To deal with correlated states, the agent builds a dataset of experience and then takes random samples from the dataset, to break the correlation. This is called Experience Replay (Mnih et al., 2013).

The agent fixes the parameters θ\theta^- and then periodically updates the fixed parameters (i.e on every K-th iteration).

Loss=[rt+1+γmaxaQ(s,a,θ)Q(s,a,θ)]2Loss = [r_{t+1} + \gamma \max_{a'} Q(s',a',\theta^-) - Q(s,a,\theta)]^2

This can be improved by prioritising the states picked up from the experience dataset, using 2 different networks for θ\theta and θ\theta^-, etc.

Regular experience replay can struggle with sparse rewards where successful experiences are rare. For example, a robot might try thousands of times to grab an object without succeeding, giving no useful learning signal.

Hindsight Experience Replay (Andrychowicz et al., 2017) addresses this by treating failures as successes for a different goal.

When the agent fails its intended goal, HER saves both the original experience and a modified version where the achieved outcome becomes the goal. This lets the agent learn from every attempt, even failed ones. The approach works by augmenting the experience tuple (st,at,rt,st+1,g)(s_t, a_t, r_t, s_{t+1}, g) with additional versions using achieved goals gg' from the same episode.

Policy Gradients

Here we learn the policy π(,)\pi(\cdot,\cdot) directly instead of learning V or Q functions and deriving a policy from them. The policy is approximated with a neural network π(s,a,ω)\pi(s, a, \omega) and the parameters ω\omega are updated using policy gradients.

The expected cumulative reward for a policy with parameters ω\omega is given by J(ω)J(\omega)

The best policy π(,)\pi^*(\cdot, \cdot) is found by doing gradient ascent on the cumulative reward i.e find the ω\omega resulting in the largest cumulative reward.

The notation τ∼π(ω) means τ is sampled from (or follows) the distribution defined by the policy π with parameters ω. It's similar to how x∼N(μ,σ²) means x is sampled from a normal distribution with mean μ and variance σ². J(ω)=Eτπ(ω)[R(τ)]=Eτπ(ω)[i=0Nγirt+i]=π(τ,ω)R(τ)dτJ(\omega) = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[R(\tau)] = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[\sum_{i=0}^N \gamma^ir_{t+i}] = \int\pi(\tau,\omega)R(\tau) d\tau

Here, τ\tau is a trajectory of (s,a)(s,a) pairs, and π(ω)\pi(\omega) is the distribution of trajectories defined by the policy π\pi with parameters ω\omega.

The ω\omega parameters are then updated as ωt+1=ωt+αωJ(ωt)\omega_{t+1} = \omega_{t} + \alpha \nabla_\omega J(\omega_t).

Since the gradient of J(ω)J(\omega) involves differentiating products, we use logarithms to simplify. This is done using the policy gradient theorem which uses the following identity.

ωf(x,ω)=f(x,ω)ωf(x,ω)f(x,ω)=f(x,ω)ωlogf(x,ω)\nabla_\omega f(x,\omega) = f(x,\omega) \frac{\nabla_\omega f(x,\omega)}{f(x,\omega)} = f(x,\omega)\nabla_\omega \log f(x,\omega) The policy gradient theorem assumes: 1) The policy is differentiable 2) State transitions and rewards are Markovian 3) Episodes have finite length

With this, the J-function’s gradient is computed as

ωJ(ω)=ωπ(τ,ω)r(τ)=π(τ,ω)ωlogπ(τ,ω)r(τ)=Eτπ(ω)[r(τ)ωlogπ(τ,ω)]\nabla_\omega J(\omega) = \int\nabla_\omega \pi(\tau,\omega)r(\tau) = \int\pi(\tau,\omega)\nabla_\omega \log \pi(\tau,\omega)r(\tau) \newline = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[r(\tau)\nabla_\omega \log \pi(\tau,\omega)]

See Lilian Weng, 2018 for a detailed derivation.

The gradient of J(w) is represented as an expectation of a process, and we can sample from the process to approximate the distribution (i.e Monte Carlo).

The log-policy logπ\nabla \log \pi can be expanded as (assuming a discrete infinite state space)

π(τ,ω)=πω(s1,a1,s2,...)=p(s1)t=1πω(atst)p(st+1st,at)logπω(τ)=logp(s1)+t=1logπω(atst)+t=1logp(st+1st,at)\pi(\tau,\omega) = \pi_\omega(s_1, a_1, s_2,...) = p(s_1)\prod_{t=1}^\infty\pi_\omega(a_t|s_t)p(s_{t+1}|s_t,a_t) \newline \log \pi_\omega(\tau) = \log p(s_1) + \sum_{t=1}^\infty \log \pi_\omega(a_t|s_t) + \sum_{t=1}^\infty \log p(s_{t+1}|s_t,a_t)

The derivative of the log-policy is then

ωlogπω(τ)=t=1ωlogπω(atst)\nabla_\omega\log\pi_\omega(\tau) = \sum_{t=1}^\infty \nabla_\omega \log \pi_\omega(a_t|s_t)

Substituting the reward function as a sum, limiting the time horizon to T, and expanding the expectation as mean over N iterations, the loss function becomes

Loss=g:=ωJ(ω)=1Ni=1N[t=1Tωlogπω(atst)][t=1Tr(si,t,ai,t)]Loss = g :=\nabla_\omega J(\omega) = \frac{1}{N}\sum_{i=1}^N \Bigr[\sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \Bigr] \Bigr[\sum_{t=1}^T r(s_{i,t},a_{i,t}) \Bigr]

The complete vanilla policy gradient algorithm is

  1. Sample NN trajectories {τi}\{\tau_i\} from πω(atst)\pi_\omega(a_t|s_t), each with a maximum duration of T
  2. Compute the loss function with existing policy neural network
  3. Back-propagate the loss into the policy neural network with an optimiser

See Sutton et al, 2000.

Note: The total reward is now represented in an un-discounted way. This is discussed later.

The vanilla policy gradient algorithm has a major issue: it updates the policy based on the total reward of the trajectory, which means actions early in the trajectory are credited/blamed for rewards that came after them. This is inefficient and can lead to incorrect learning.

REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) solves this by using temporal causality - each action is only updated based on the rewards that came after it. The algorithm is:

  1. Sample trajectory τ\tau from policy πω\pi_\omega
  2. For each timestep t:
    • Compute return Gt=k=tTγktrkG_t = \sum_{k=t}^T \gamma^{k-t}r_k (only future rewards)
    • Update policy: ωω+αγtGtωlogπω(atst)\omega \leftarrow \omega + \alpha\gamma^t G_t\nabla_\omega\log\pi_\omega(a_t|s_t)

This gives a more focused learning signal for each action based only on its consequences.

See Sutton et al, 2000 and Williams, 1992 for more details.

Benefit of log-policy over naive policy

The conversion of a policy to the log-policy prevents the loss from exploding, or going to zero when the states highly correlated (which causes gradients to be large or small).

High Variance of Policy Gradients

The policy gradient algorithm suffers from high variance because:

  1. The stochastic policy π\pi can generate different actions for the same states, leading to different trajectories
  2. These different trajectories can result in vastly different total rewards
  3. This variance in rewards creates noisy gradient estimates, making it difficult for the algorithm to converge to an optimum

While Monte Carlo sampling in policy gradients is unbiased (meaning it gives the correct gradient in expectation), its high variance means individual gradient estimates can point in conflicting directions.

Solutions to reduce variance:

  1. Larger batch sizes: Averaging over more trajectories reduces variance in the gradient estimate. However, this comes at the cost of reduced sample efficiency since more environment interactions are needed per update.

  2. Baseline subtraction: Introduce a baseline (typically the state-value function V(s)V(s)) to center the rewards. This changes the objective from tracking absolute rewards to tracking relative advantages:

    • Original: θJ(θ)=Eτπθ[tθlogπθ(atst)Rt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]
    • With baseline: θJ(θ)=Eτπθ[tθlogπθ(atst)(RtV(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) (R_t - V(s_t))]

The baseline reduces variance by normalizing the scale of rewards. To understand why:

  1. Consider two actions in a state:
    • Action A: Usually gives rewards of either 90 or 110
    • Action B: Usually gives rewards of either -10 or 10
  2. Without a baseline:
    • For action A, the policy gradient will use large values (90 or 110)
    • For action B, the policy gradient will use smaller values (-10 or 10)
    • These different scales make learning difficult and increase variance
  3. With a baseline V(s) = 100 for action A's state and V(s) = 0 for action B's state:
    • Action A's advantages become: -10 or +10 (from 90-100 or 110-100)
    • Action B's advantages remain: -10 or +10
    • Now both actions use similar scales, reducing variance

The baseline doesn't change which action is better (their relative ordering stays the same), but it makes the learning signal more consistent by removing the effect of different reward scales across states.

The baseline reduces variance without introducing bias, since Eπθ[θlogπθ(atst)V(st)]=0\mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t)] = 0.

Advantage of an action in a state over a baseline

We can add or subtract a term to the optimisation problem as long as the term is not related to the parameters ω\omega.

g=1Ni=1N[t=1Tωlogπω(atst)][t=1Tr(si,t,ai,t)b(si,t)]g = \frac{1}{N}\sum_{i=1}^N \Bigr[\sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \Bigr] \Bigr[\sum_{t=1}^T r(s_{i,t},a_{i,t}) - b(s_{i,t})\Bigr]

The first term of the reward sum is the action-value function Qπ(s,a)Q_\pi(s,a) and the second term is a baseline term. Their difference is called the advantage Aπ(s,a)A_\pi(s,a) of an action aa in state ss, when using policy π\pi, over the baseline bb

g=1Ni=1Nt=1Tωlogπω(atst) Aπ(si,ai)g = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \ A_\pi(s_{i}, a_{i})

The most common baseline is the state-value function Vπ(s)V_\pi(s) which represents the total reward expected at state ss when following policy π\pi. So Aπ(s,a)=Qπ(s,a)Vπ(s)A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)

With a baseline, the updated algorithm is

  1. Sample NN trajectories {τi}\{\tau_i\} from πω(atst)\pi_\omega(a_t|s_t), each with a maximum duration of T
    1. At each time-step in each trajectory, compute total reward Rt=k=tTrkR_t = \sum_{k=t}^Tr_k and advantage estimate At=Rtb(st)A_t = R_t - b(s_t)
    2. The total reward can also be discounted Rt=k=tTγktrkR_t = \sum_{k=t}^T \gamma^{k-t}r_k
  2. Re-fit the baseline by minimising the error b(st)Rt2||b(s_t) - R_t||^2 over all trajectories and time steps
  3. Compute the new loss function
  4. Back-propagate the loss into the policy neural network with an optimiser

See Schulman et al, 2016 for more details.

Reward variations

  1. Causal Rewards: The loss function can be updated to make the reward calculation causal, by computing rewards from tTt \rightarrow T instead of from 1T1 \rightarrow T
  2. Discounted Rewards: Reward discount (γ\gamma) reduces variance by reducing the impact of distant actions. This is also strictly necessary to interpret the loss function as a true gradient.

General Policy Gradient Loss

Schulman et al, 2016 explains that policy gradient methods maximise the expected total reward by repeatedly estimating the gradient g:=θE[t=0rt]g := ∇_θ \mathbb E[\sum_{t=0}^∞ r_t]. There are several different related expressions for the policy gradient, which have the form.

g=Eτπ[t=0Ψtθlogπθ(atst)]g = \mathop{\mathbb{E}}_{\tau\sim \pi} \Bigr[\sum_{t=0}^\infty \Psi_t \nabla_\theta \log \pi_\theta(a_t | s_t) \Bigr]

Where Ψt\Psi_t may be one of the following

  1. t=0rt\sum_{t=0}^\infty r_t : total reward of the trajectory
  2. t=Trt\sum_{t=T}^\infty r_t : total reward following action aTa_T
  3. t=Trtb(sT)\sum_{t=T}^\infty r_t - b(s_T) : total reward following (sT,aT)(s_T,a_T) w.r.t a baseline
  4. Qπ(st,at)Q_\pi(s_t,a_t) : state-action value function
  5. Aπ(st,at)A_\pi(s_t,a_t) : advantage function
  6. rt+1+Vπ(st+1)Vπ(st)r_{t+1} + V_\pi(s_{t+1}) - V_\pi(s_t) : One-step TD residual

Problems with Policy Gradients

An analogy might help understand these challenges:

Imagine teaching someone to cook by giving them recipe feedback. Policy Gradients is like:

  1. Having to throw away all previous cooking notes after each attempt (sample efficiency)
  2. Tiny adjustments to ingredient amounts sometimes completely ruining the dish (parameter space mismatch)
  3. Not properly considering how early cooking steps affect the final taste (gradient estimation)

This is why modern RL often uses more sophisticated methods that build guardrails around these issues, similar to how cooking schools provide structured learning environments rather than pure trial and error.

There are three main problems with policy gradients:

  1. Poor Sample Efficiency

    • Each policy update invalidates previously collected data since the policy has changed
    • Unlike Deep Q-Networks where old experiences remain useful, PG methods require fresh data for each update
    • This is because PG uses on-policy expectations - we can only use data collected from the current policy
  2. Parameter-Policy Space Mismatch

    • The parameter space (ω\omega) and policy space (π\pi) are different manifolds with different metrics, meaning distances/lengths in one space don’t correspond to equivalent distances in the other
    • This means tiny parameter updates can cause unexpectedly large policy changes, making optimization unstable, and making it difficult to control how much the policy actually changes during updates
    • The disconnect between parameter updates and policy behavior makes finding appropriate learning rates challenging
  3. Gradient Estimation Issues

    • The policy gradient estimator g(,θ)g(\cdot,\theta) doesn’t properly incorporate discounted rewards
    • This leads to biased gradient estimates that can harm convergence
    • The bias comes from treating all time steps equally instead of properly weighting future rewards

See the callout for more details on these issues.

Let's break down the math behind issues 1 and 2:

1. Sample Efficiency Issue

When we collect data using policy π₁ and then update to π₂, we need to compare how likely each trajectory is under both policies. This is done using the probability ratio:

P(τ|π₂)/P(τ|π₁) = ∏ᵢ π₂(aᵢ|sᵢ)/π₁(aᵢ|sᵢ)

This ratio represents how much more/less likely the trajectory τ is under the new policy π₂ compared to the old policy π₁. Since it's a product of many terms (one for each timestep), even small differences between π₁ and π₂ get amplified exponentially, making old data unusable.

2. Parameter-Policy Space Mismatch

The parameter space θ and policy space π have fundamentally different distance metrics. In parameter space, we use Euclidean distance ||θ₁ - θ₂||, while in policy space we measure distances using KL divergence KL(πθ₁||πθ₂). These metrics don't align:

KL(πθ₁||πθ₂) ≠ c||θ₁ - θ₂||

where KL is Kullback-Leibler divergence and c is any constant. This means a small step in parameter space can cause a large change in policy space. The relationship is highly non-linear and makes optimization unstable.

Importance Sampling

Basics

Sample efficiency can be improved by taking trajectories from older policies and changing the loss function to account for the difference.

Importance sampling is a technique for estimating expectation of a function f(x),xP(x)f(x),x \sim P(x) using samples drawn from a different distribution Q(x)Q(x).

ExP[f(x)]=ExQ[f(x)P(x)Q(x)]\mathop\mathbb{E}_{x\sim P}[f(x)] = \mathop\mathbb{E}_{x\sim Q}\Bigr[f(x)\frac{P(x)}{Q(x)}\Bigr]

The ratio W(x)=P(x)/Q(x)W(x) = P(x)/Q(x) is the importance sampling weight for x. Let μ^Q\hat{\mu}_Q be our importance sampling estimator - the average of f(x)W(x)f(x)W(x) over N samples drawn from Q. The variance of this estimator is:

var(μ^Q)=1Nvar(f(x)W(x))=1N(ExQ[f(x)W(x)]2[ExQf(x)W(x)]2)=1N(ExP[P(x)Q(x)f(x)2][ExPf(x)]2)var(\hat{\mu}_Q) = \frac{1}{N}var(f(x)W(x)) \newline = \frac{1}{N} \Bigr( \mathop\mathbb{E}_{x\sim Q}[f(x)W(x)]^2 - [\mathop\mathbb{E}_{x\sim Q}f(x)W(x)]^2 \Bigr) \newline = \frac{1}{N} \Bigr( \mathop\mathbb{E}_{x\sim P}\Big[ \frac{P(x)}{Q(x)}f(x)^2 \Big] - [\mathop\mathbb{E}_{x\sim P}f(x)]^2 \Bigr)

The first term can be a major problem because it involves the ratio P(x)/Q(x). If Q(x) is small in regions where P(x)f(x)² is large, this ratio explodes and causes extremely high variance. This insight leads to several key developments in later sections:

  1. In policy gradients, this manifests as the ratio of policy probabilities becoming unstable when policies diverge too much
  2. Trust region methods like TRPO address this by explicitly constraining how much policies can change
  3. PPO simplifies this by clipping the ratio to a fixed range, providing a more practical solution

This variance issue is fundamental to why naive policy gradient methods can be unstable and why constraining policy updates becomes so important.

For a detailed treatment of importance sampling, see Owen, 2013 which covers both the theory and practical considerations.

Usage in Policy Gradients

Using the importance sampling technique described above, we can improve the efficiency of policy gradient methods. Instead of sampling trajectories from the current policy ω\omega being optimized, we can reuse trajectories from an older policy ω\omega' by applying importance sampling weights:

g=ωJ(ω)=Eτω[tγtωlogπω(atst)Aωπ(st,at)]=Eτω[tP(τtω)P(τtω)γtωlogπω(atst)Aωπ(st,at)]g = \nabla_\omega J(\omega) = \mathop\mathbb E_{\tau\sim\omega}\Big[ \sum_{t}\gamma^t\nabla_\omega\log \pi_\omega(a_t|s_t) A_\omega^\pi(s_t,a_t) \Big] \newline = \mathop\mathbb E_{\tau\sim\textcolor{red}{\omega'}}\Big[ \sum_{t} \textcolor{red}{\frac{P(\tau_t|\omega)}{P(\tau_t|\omega')}} \gamma^t\nabla_\omega\log \pi_\omega(a_t|s_t) A_\omega^\pi(s_t,a_t) \Big]

However, this introduces a new challenge. Each term P(τ)P(\tau|\cdot) is a product of state transition and policy probabilities: p(s0)p(st+1st,at)π(atst)p(s_0) \prod p(s_{t+1}|s_t,a_t)\pi(a_t|s_t). Due to this product structure, even small differences between policies can cause the importance sampling weight to become extremely large or small, leading to the high variance problem discussed earlier. This variance issue will be addressed by Natural Policy Gradient (NPG) and subsequent methods.

Monotonic Improvement Algorithms

Monotonic Improvement Algorithms are a class of policy optimization methods that guarantee each policy update will improve (or at least not decrease) the expected reward. This directly addresses the variance issues we saw with importance sampling in policy gradients, where large policy changes could lead to unstable updates.

The key idea is to bound policy changes while still maximizing improvement. By carefully controlling how much policies can diverge from each other, these methods avoid the exploding importance sampling ratios we discussed earlier.

Relative Policy Performance Identity

For two policies π\pi and π\pi’, the relative performance can be computed as (see Schulman et al, 2017 Appendix A for proof)

J(π)J(π)=Eτπ[tγtAπ(st,at)]J(\pi')-J(\pi) = \mathop\mathbb E_{\tau\sim\pi'} \Big[\sum_t \gamma^t A_\pi(s_t,a_t) \Big]

We can use this identity to sample from old policy π\pi and use it to improve the current one π\pi’.

maxπJ(π)=maxπ[J(π)J(π)]=maxπEτπ[tγtAπ(st,at)]\max_{\pi'} J(\pi') = \max_{\pi'} [J(\pi') -J(\pi)] = \max_{\pi'} \mathop\mathbb E_{\tau\sim\pi'} \Big[\sum_t \gamma^t A_\pi(s_t,a_t) \Big]

This still requires trajectories sampled from π\pi’. To fix this, first rewrite the expectation in terms of states and actions (see the Ergodic Theorem).

J(π)J(π)=11γEaπ,sdπ[Aπ(st,at)]=11γEaπ,sdπ[π(as)π(as)Aπ(st,at)]J(\pi') -J(\pi) = \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\pi', s\sim d_{\pi'}} \Big[A_\pi(s_t,a_t) \Big] \newline = \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\textcolor{red}{\pi}, s\sim d_{\pi'}} \Big[ \textcolor{red}{\frac{\pi'(a|s)}{\pi(a|s)}} A_\pi(s_t,a_t) \Big]

where dπ(s)d_\pi(s) is the discounted future state distribution (normalized to 1) and is defined as

dπ(s)=(1γ)t=0γtP(st=sπ)=(1γ)[P(s0=s)+γP(s1=s)+γ2P(s2=s)+..]d_\pi(s) = (1-\gamma) \sum_{t=0}^\infty \gamma^t P(s_t=s|\pi) \newline = (1-\gamma)[P(s_0=s) +\gamma P(s_1=s)+\gamma^2P(s_2=s)+..]

The discounted future state distribution dπ(s) represents the normalized discounted sum of probabilities of being in state s at each timestep t when following policy π. The (1-γ) term normalizes the infinite sum to create a proper probability distribution.

While similar in concept, this is distinct from the stationary distribution of the Markov chain induced by π. The stationary distribution, if it exists, is the limiting distribution as t→∞, while dπ(s) explicitly weights earlier timesteps more heavily through the discount factor.

The state visitation frequency ρπ(s) measures how often a policy visits each state in practice. For infinite horizon problems with γ=1, dπ(s) converges to ρπ(s), making it a useful theoretical bridge between discounted and average reward settings.

This equation still has states sampled from π\pi’ which can be fixed as

J(π)J(π)11γEaπ,sdπ[π(as)π(as)Aπ(st,at)]=Lπ(π)J(\pi') -J(\pi) \approx \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\pi, s\sim \textcolor{red}{d_{\pi}}} \Big[ \frac{\pi'(a|s)}{\pi(a|s)} A_\pi(s_t,a_t) \Big] = \mathcal{L}_\pi(\pi')

Where Lπ(π)\mathcal L_\pi(\pi') is the lower bound or local approximation of the relative performance.

It can be proven that the residual error J(π)J(π)Lπ(π)J(\pi’)-J(\pi)-\mathcal L_\pi(\pi’) has an upper bound related to the KL-divergence between the two policies. See Constrained Policy Optimization; Achiam et al, 2017 for proof. So if the divergence is very small, this approximation is very accurate.

J(π)J(π)Lπ(π)CEsdπ[DKL(ππ)[s]]J(\pi’)-J(\pi)-\mathcal L_\pi(\pi’) ≤ C \sqrt{\mathbb E_{s\sim d_\pi}[D_{KL}(\pi'||\pi)[s]] }

Majorize-Minimization

Since LCE[DKL(..)]\mathcal L - C \sqrt{\mathbb E[D_{KL}(..)]} is a lower bound on the relative performance between π\pi and π\pi’, we can use it as a surrogate objective, instead of the main objective. Maximising it will always maximise the true objective. See MM algorithm

The Majorize-Minimization (MM) algorithm is a general optimization technique that works by iteratively constructing and optimizing surrogate functions.

At each step, it creates a surrogate function that "majorizes" (upper bounds) the objective function and touches it at the current point. By minimizing this surrogate, we're guaranteed to improve or maintain the original objective.

The key idea is that the surrogate should be easier to optimize than the original function while still preserving the essential properties that ensure convergence. This approach is particularly useful when the original objective is difficult to optimize directly.

Here, both L\mathcal L and the KL-divergence can be estimated from samples of the policy π\pi

Note: The value of C provided in the paper can be high if γ\gamma is near 1. Their recommendation is to use a KL constraint instead of KL-penalty implied in the original equation.

Constraint Equation

πk+1=arg maxπLπk(π)such that Esdπk[DKL(π,πk)[s]]δ\pi_{k+1} = \argmax_{\pi'} \mathcal L_{\pi_k} (\pi') \newline such\ that\ \mathbb E_{s\sim d_{\pi_k}}[D_{KL}(\pi',\pi_k)[s]] ≤ \delta

The δ\delta is called a trust region (see Trust region)

This algorithm fixes the two failure modes of sample efficiency and incorrect length measurements.

Natural Policy Gradient

While the policy gradient methods above work, they can be inefficient since they don’t account for the geometry of the policy space. Natural Policy Gradient (NPG) addresses this by using information geometry to find better update directions (see A Natural Policy Gradient).

Metric Tensor Basics

A metric tensor GG defines how we measure distances in a space. In the context of policy optimization, this is crucial because it helps us measure how “different” two policies are.

Think of the metric tensor like a "ruler" that changes based on where you are in the space. Just like measuring distance on a curved surface requires knowing how the surface bends, measuring distance between policies requires knowing how the policy space curves.

The squared distance between nearby points v\vec{v} and v+δv\vec{v}+\delta v is given by:

dist2(v,v+δv)=δvTG(v)δvdist^2(\vec{v}, \vec{v}+\delta v) = \delta v^T G(v) \delta v

In the simplest case (orthogonal coordinates), GG is just the identity matrix, giving us the familiar Euclidean distance:

dist2(v,v+δv)=δvTδv=i(δvi)2dist^2(\vec{v}, \vec{v}+\delta v) = \delta v^T \delta v = \sum_i (\delta v_i)^2

When we change coordinate systems from v\vec{v} to w\vec{w}, the relationship between small changes in each system is:

δvi=jviwjδwj\delta v_i = \sum_j \frac{\partial v_i}{\partial w_j} \delta w_j

We can write this more compactly using the Jacobian matrix AA where Aij=vi/wjA_{ij} = \partial v_i/\partial w_j:

δv=Aδw\delta v = A \delta w

The metric tensors in the two coordinate systems (GvG_v and GwG_w) are related by:

Gw=AGvATG_w = AG_v A^T

Similarly, gradients transform as:

gw=Agvg_w = Ag_v

The natural gradient direction in each coordinate system (dv=Gv1gv\vec{dv} = G^{-1}_v g_v and dw=Gw1gw\vec{dw} = G^{-1}_w g_w) are related by:

dw=(AT)1dv\vec{dw} = (A^T)^{-1}\vec{dv}

This last equation shows that G1gG^{-1}g transforms properly under coordinate changes - it’s a covariant vector. This is why natural gradients are “natural”: they give us update directions that are invariant to how we parameterize our policy.

Metric Tensor for Probability Spaces

(see Amari, S. (1998))

When optimizing policies, we want to find the direction that improves our objective J(θ)J(\theta) most efficiently. This is the direction of steepest descent - the vector dθd\theta that minimizes J(θ+dθ)J(\theta+d\theta) while keeping the step size dθ2|d\theta|^2 constant.

If we measure distances using a metric tensor GG, then the steepest descent direction is G1J(θ)G^{-1}\nabla J(\theta). Crucially, this metric should depend on the underlying probability space our policy operates in, not on how we happened to parameterize it.

The Fisher Information Matrix measures how much a probability distribution changes when we adjust its parameters. It characterizes the local curvature of the probability space by measuring changes in the log-likelihood – regions where small parameter changes cause large changes in log-likelihood will have high Fisher Information.

For our policy optimization, Fs(θ) measures how sensitive our policy's action distribution is to parameter changes in state s. Small changes to parameters that dramatically alter the policy's log-likelihood in a state will result in high Fs(θ) values for that state.

The full metric G averages these state-specific sensitivities according to how often we visit each state. This means we're naturally more careful with parameters that affect frequently-visited states, helping prevent destabilizing changes to critical parts of our policy.

The Fisher Information Matrix provides the appropriate metric tensor for our optimization, as it captures the local geometry of the probability space. For each state ss, our policy π(a;s,θ)\pi(a;s,\theta) defines a probability distribution over actions, parameterized by θ\theta. The Fisher Information Matrix for this distribution is:

Fs(θ)=Eπ(a;s,θ)[logπ(a;s,θ)θilogπ(a;s,θ)θj]F_s(\theta) = \mathbb E_{\pi(a;s,\theta)} \Bigr[ \frac{\partial \log \pi(a;s,\theta)}{\partial \theta_i} \frac{\partial \log \pi(a;s,\theta)}{\partial \theta_j} \Bigr]

Since our policy operates over many states, we average these matrices according to how often we visit each state, giving us our final metric tensor:

GF(θ)=Esdπ[Fs(θ)]G \equiv F(\theta) = \mathbb E_{s\sim d_\pi}[F_s(\theta)]

where dπd_\pi is the stationary state distribution under policy π\pi.

Changes needed

The update rule for the network parameters is θk+1=θk+αg\theta_{k+1} = \theta_k + \alpha g where gg is the policy gradient’s estimator.

The first change is to use the gradient with a proper metric i.e gF1gg \rightarrow F^{-1}g

The second change is to compute α\alpha to satisfy the KL-divergence constraint specified in the importance sampling section (see constraint equation).

Derivation of metric and coefficient

The lower-bound L\mathcal L and KL-divergence equations are expanded via Taylor series as

Lθk(θ)Lθk(θk)+gΔθL_{\theta_k}(\theta) \approx L_{\theta_k}(\theta_k) + g\Delta\theta DKL(θθk)DKL(θkθk)+ΔθTθDKL(θθk)θ=θk+12ΔθTθ2DKL(θθk)θ=θkΔθD_{KL}(\theta||\theta_k) \approx D_{KL}(\theta_k||\theta_k) + \Delta\theta^T \nabla_\theta D_{KL}(\theta||\theta_k)|_{\theta=\theta_k} \newline + \frac{1}{2} \Delta\theta^T \nabla^2_\theta D_{KL}(\theta||\theta_k)|_{\theta=\theta_k}\Delta\theta

where Δθ=θθk\Delta\theta=\theta-\theta_k is the proposed update of parameters.

Here, the first part of both expansions are zero. The first-derivative term of the KL-divergence expansion is also zero as shown in Slide 33 of Lecture 13: Natural Policy Gradients

The second-derivative part simplifies to θ2DKL(..)=Eθk[θlogπθlogπT]\nabla^2_\theta D_{KL}(..) = \mathbb E_{\theta_k}[\nabla_\theta\log\pi\cdot\nabla_\theta\log\pi^T] which is the Fischer information matrix.

F(θk)=θ2DKL(πθπθk)DKL(πθπθk)=12ΔθTF(θk)ΔθF(\theta_k) = \nabla^2_\theta D_{KL}(\pi_\theta||\pi_{\theta_k}) \newline D_{KL}(\pi_\theta||\pi_{\theta_k}) = \frac{1}{2}\Delta\theta^TF(\theta_k)\Delta\theta

The best Δθ\Delta\theta that we can get is found by solving the following optimization problem which is equivalent to the constrained problem specified earlier.

Δθ=arg maxΔθ=θθk[Lθk(θ)λDKL(πθπθk)]=arg minΔθ[gΔθ+12λΔθTF(θk)Δθ]\Delta\theta^* = \argmax_{\Delta\theta=\theta-\theta_k} \Big[ L_{\theta_k}(\theta) - \lambda D_{KL}(\pi_\theta||\pi_{\theta_k}) \Big] \newline = \argmin_{\Delta\theta} \Big[-g\Delta\theta+\frac{1}{2}\lambda\Delta\theta^TF(\theta_k)\Delta\theta \Bigr]

(arg max\argmax became arg min\argmin by a multiplication of 1-1 throughout)

The term inside the arg min\argmin can be differentiated w.r.t Δθ\Delta\theta and set to 0 to find the minimum.

0=g+λF(θk)Δθor:Δθ=λF(θk)1g0 = -g + \lambda F(\theta_k)\Delta\theta^* \newline or:\quad \Delta\theta^* = \lambda'F(\theta_k)^{-1}g

The update rule is now θk+1θk+αF(θk)1gk\theta_{k+1} \leftarrow \theta_k + \alpha F(\theta_k)^{-1}g_k

To derive the learning rate α\alpha, we simplify the KL-divergence expansion and substitute Δθ=αF1g\Delta\theta = \alpha F^{-1}g. Since F is symmetric, F=FTF=F^T and (F1)T=F1(F^{-1})^T=F^{-1}

DKL(θθk)12ΔθTF(θk)Δθ=12(αF1g)TF(αF1g)δα2δgTF1gD_{KL}(\theta||\theta_k) \approx \frac{1}{2}\Delta\theta^TF(\theta_k)\Delta\theta = \frac{1}{2}(\alpha F^{-1}g)^TF(\alpha F^{-1} g) ≤ \delta \newline \Rightarrow \alpha ≤ \sqrt{\frac{2\delta}{g^TF^{-1}g}}

Algorithm

  1. Collect trajectories on the policy πk=π(θk)\pi_k = \pi(\theta_k)
  2. Estimate advantages AtπkA_t^{\pi_k} using any advantage estimation algorithm
  3. Form sample estimates for
    1. Policy gradient gkg_k using the advantage estimates
    2. Fisher information (or KL-divergence hessian) FkF_k
  4. Compute the parameter update as follows and backpropagate
θk+1=θk+2δgkTFk1gkFk1gk\theta_{k+1} = \theta_k + \sqrt{\frac{2\delta}{g_k^TF^{-1}_kg_k}}F^{-1}_kg_k

Truncated NPG

While Natural Policy Gradient provides theoretically sound updates, it faces a practical computational challenge: the Fisher Information Matrix FF grows quadratically with the number of parameters NN. Specifically:

To make NPG practical, we can avoid explicitly computing F1F^{-1} by using the conjugate gradient algorithm. Instead of directly computing x=F1gx = F^{-1}g, we reformulate the problem as solving the linear system:

Fx=gFx = g

The conjugate gradient method is an iterative algorithm for solving linear systems Ax=b where A is symmetric and positive definite. Rather than computing A⁻¹ directly, it:

  1. Starts with an initial guess x₀
  2. Iteratively improves the solution by moving along carefully chosen directions
  3. These directions are "conjugate" (A-orthogonal) to each other, ensuring efficient convergence
  4. Often finds good approximate solutions in far fewer than n iterations

Think of conjugate gradient as finding the valley floor of a quadratic bowl (Fx=g) by taking clever steps that are always perpendicular to previous steps with respect to the bowl's shape (F). This is much more efficient than trying to map out the entire bowl (computing F⁻¹).

The conjugate gradient method is particularly efficient because:

  1. We never need to store FF explicitly - we only need to compute matrix-vector products FvFv
  2. These products can be computed efficiently using automatic differentiation
  3. We can truncate the iteration early once we have a “good enough” solution
  4. The algorithm naturally exploits any sparsity or structure in FF

This “truncated” version of NPG maintains most of the benefits of natural gradient descent while being computationally tractable for large neural networks. See Lecture 5: Advanced Policy Gradient Methods for more details.

Trust Region Policy Optimization

While NPG provides a theoretically sound way to compute update directions, there’s still a practical challenge: choosing the right step size. The trust region parameter δ\delta that we derived earlier might be:

A trust region is like a safety zone where we believe our approximations are reliable. If we step outside this zone, our estimates of improvement might be wrong. Line search helps us find the largest safe step we can take within this zone.

To address this, TRPO uses line search - an adaptive way to scale the update. The idea is to try progressively smaller steps until we find one that satisfies our constraints and improves the policy. See Schulman et al. (2015) for more details.

TRPO combines two key ideas:

  1. Natural gradient for computing good update directions
  2. Line search for finding the largest safe step size

This makes it more robust than basic policy gradient methods while still being computationally tractable.

Line Search Algorithm

  1. Compute the proposed step direction Δθk\Delta\theta_k (as we’ll see below)
  2. For j=0j = 0 to LL:
    • Try the update: θ=θk+αjΔθk\theta = \theta_k + \alpha^j \Delta\theta_k
    • If both conditions are met:
      • Policy improves: Lθk(θ)0\mathcal L_{\theta_k} (\theta) ≥ 0
      • Within trust region: DKL(θθk)δD_{KL}(\theta\|\theta_k) ≤ \delta
    • Then accept: θk+1=θk+αjΔθk\theta_{k+1} = \theta_k + \alpha^j \Delta\theta_k
    • Otherwise continue with smaller step (α\alpha decreases exponentially)

Complete TRPO Algorithm

  1. Sample Collection:

    • Collect trajectories using current policy πk=π(θk)\pi_k = \pi(\theta_k)
  2. Advantage Estimation:

    • Estimate advantages AtπkA_t^{\pi_k} for each state-action pair
  3. Gradient Computation:

    • Compute policy gradient gkg_k
    • Estimate Fisher information matrix FkF_k
  4. Natural Gradient:

    • Use conjugate gradient (with NcgN_{cg} iterations) to compute xkFk1gkx_k \approx F^{-1}_kg_k
  5. Step Sizing:

    • Compute the proposed step: Δθk=2δxkTFkxkxk\Delta\theta_k = \sqrt{\frac{2\delta}{x_k^TF_kx_k}}x_k
  6. Line Search:

    • Use backtracking line search to find final update: θk+1=θk+αjΔθk\theta_{k+1} = \theta_k + \alpha^j \Delta\theta_k

Proximal Policy Optimization

Instead of using the natural gradient F1gF^{-1}g and line search to exactly enforce the KL-divergence constraint, PPO uses simpler methods to approximately enforce this constraint. This achieves similar benefits to TRPO’s trust region optimization but with less computational overhead.

There are two variants

Adaptive KL Penalty

The first PPO variant uses an adaptive penalty coefficient to softly enforce the KL-divergence constraint. Instead of explicitly constraining how much the policy can change (as in TRPO), it adds a penalty term to the objective:

  1. The policy update solves an unconstrained optimization problem: θk+1=arg maxθLθk(θ)βkDKL(θθk)\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}(\theta) -\beta_k D_{KL}(\theta||\theta_k)

    Here, βk\beta_k controls how strongly we penalize changes to the policy - higher values of βk\beta_k mean smaller policy updates.

  2. After each update, we adjust βk\beta_k based on whether the actual KL-divergence was too large or too small:

    • If the policy changed too much (KL too high), we increase βk\beta_k to enforce smaller updates
    • If the policy changed too little (KL too low), we decrease βk\beta_k to allow larger updates

This adaptive scheme maintains the benefits of trust region optimization while being simpler to implement than TRPO’s exact constraint enforcement.

Algorithm

  1. Collect partial trajectories on the policy πk=π(θk)\pi_k = \pi(\theta_k)
  2. Estimate advantages AtπkA_t^{\pi_k} using any advantage estimation algorithm
  3. Compute policy update as
    1. θk+1=arg maxθLθk(θ)βkDKL(θθk)\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}(\theta) -\beta_k D_{KL}(\theta||\theta_k)
    2. by taking K steps of minibatch SGD (via Adam)
  4. if DKL(θk+1θk)1.5δD_{KL}(\theta_{k+1}||\theta_k) ≥ 1.5 \delta then βk+1=2βk\beta_{k+1} = 2 \beta_k
  5. if DKL(θk+1θk)δ/1.5D_{KL}(\theta_{k+1}||\theta_k) ≤ \delta/1.5 then βk+1=βk/2\beta_{k+1} =\beta_k/2

Clipped objective function

The second PPO variant takes an even simpler approach to constraining policy updates. Rather than using an adaptive KL penalty, it directly clips the importance sampling ratio to prevent large policy changes.

Recall that in the previous approach, we used a KL penalty term to softly constrain how much the policy could change. Here, we instead modify the objective function Lθk(θ)\mathcal L_{\theta_k}(\theta) to clip the importance sampling ratio rt(θ)r_t(\theta) to stay within [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon], where:

rt(θ)=πθ(atst)πθk(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}

The clipped objective function is then:

LθkCLIP(θ)=Eτπk[t=0Tmin(rt(θ)Atπk,clip(rt(θ),1ϵ,1+ϵ)Atπk)]\mathcal L_{\theta_k}^{CLIP}(\theta) = \mathop\mathbb E_{\tau\sim\pi_k}\Big[ \sum_{t=0}^T \min\Big( r_t(\theta)A^{\pi_k}_t, clip(r_t(\theta), 1-\epsilon,1+\epsilon)A^{\pi_k}_t \Big) \Big]

This clipping has an intuitive effect:

Like the adaptive KL approach, the policy update is simply:

θk+1=arg maxθLθkCLIP(θ)\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}^{CLIP}(\theta)

This clipped objective provides similar benefits to TRPO’s trust region constraint and the adaptive KL penalty, but with even simpler implementation since we only need to clip a ratio rather than compute KL divergences or adjust penalty coefficients.

Algorithm

  1. Collect partial trajectories on the policy πk=π(θk)\pi_k = \pi(\theta_k)
  2. Estimate advantages AtπkA_t^{\pi_k} using any advantage estimation algorithm
  3. Compute policy update as
    1. θk+1=arg maxθLθkCLIP(θ)\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}^{CLIP}(\theta)
    2. by taking K steps of minibatch SGD using the Adam optimizer:
      • Sample a minibatch of transitions (st,at,At)(s_t, a_t, A_t) from the collected trajectories
      • Compute the clipped objective LθkCLIP(θ)\mathcal L_{\theta_k}^{CLIP}(\theta) for this minibatch
      • Update θ\theta using Adam to maximize LθkCLIP(θ)\mathcal L_{\theta_k}^{CLIP}(\theta)
      • Repeat for K iterations
      • K is typically between 3-10 epochs over the entire dataset

Modern Q-Learning Approaches

Modern Q-learning approaches have built upon the foundational DQN algorithm by incorporating key insights from policy gradient methods. These improvements address several limitations of vanilla DQN:

  1. Sample efficiency - DQN requires many environment interactions to learn
  2. Value overestimation - The max operator in Q-learning can lead to systematic overestimation
  3. Exploration - Simple ε-greedy exploration doesn’t scale well to complex environments
  4. Offline learning - DQN struggles to learn from static datasets without environment interaction

Recent algorithms like SAC (Soft Actor-Critic) and CQL (Conservative Q-Learning) have made significant progress on these challenges by:

Let’s examine some of these modern approaches and how they build upon DQN while incorporating lessons from policy optimization.

Conservative Q-Learning

Offline RL Basics

In offline RL techniques, we learn a policy π\pi that maximize their expected future discounted rewards η(π)\eta(\pi) not by interacting with their environment and receiving rewards but by learning from a static dataset D\mathcal D of transitions (s,a,s,r)(s,a,s',r) which were generated by a different behaviour policy πβ\pi_\beta and its marginal state distribution dπβ(s)d^{\pi_\beta}(s)

(this behaviour policy can be another NN, or an expert policy, a human demonstrator etc)

Since D\mathcal D does not contain all possible transitions (s,a,s)(s,a,s') the algorithms use TD(1) methods (i.e backing up a single sample) to learn Qθ(s,a)Q_\theta(s,a)

TD(1) is used here because we can't compute expectations over all possible next states and actions - we only have the transitions that exist in our dataset. So instead of trying to estimate the true expected value, we just use the single observed next state and reward as our target. This makes the learning more stable given the limited data. Q^k+1=arg minQEs,a,sD[r(s,a)+γEaπ^k(as)[Q^k(s,a)]Q(s,a)]2π^k+1=arg maxπEsD,aπk(as)[Q^k+1(s,a)]\hat Q_{k+1} = \argmin_Q \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[r(s,a) + \gamma\mathop\mathbb E_{a'\sim\hat\pi_k(a'|s')}[\hat Q_k(s',a')] -Q(s,a)\Big]^2 \newline \hat \pi_{k+1} = \argmax_\pi \mathop\mathbb E_{s\sim\mathcal D,a\sim\pi_k(a|s)} \Big[\hat Q_{k+1}(s,a)\Big]

Read Levine, 2020 for a comprehensive overview of offline RL.

Bellman Operators and Q-Learning

The foundation of Q-learning lies in the concept of Bellman operators, which provide a mathematical framework for iteratively improving value estimates. Let’s build this up step by step.

The Optimal Bellman Operator

The optimal Bellman operator B\mathcal B^* transforms one Q-function into another, and when applied repeatedly, converges to the optimal Q-function QQ^* (Bellman, 1957):

BQ(s,a)=r(s,a)+γEsP(ss,a)[maxaQ(s,a)]\mathcal B^*Q(s,a) = r(s,a) + \gamma\mathop\mathbb E_{s'\sim P(s'|s,a)} [\max_{a'} Q(s',a')] Think of the Bellman operator as a "improvement machine" - you put in a Q-function, and it returns a better one. The optimal operator always chooses the best possible next action (that's what the max does), assuming perfect knowledge of state transitions.

The Policy Bellman Operator

In practice, we often work with a specific policy π\pi rather than always taking the maximum action. This gives us the policy Bellman operator Bπ\mathcal B^\pi (Sutton & Barto, 2018):

BπQ(s,a)=r(s,a)+γEsT(ss,a), aπ(as)[Q(s,a)]\mathcal B^{\pi}Q(s,a) = r(s,a) + \gamma\mathop\mathbb E_{s'\sim T(s'|s,a),\ a'\sim\pi(a'|s')} [Q(s',a')]

Here, T(ss,a)T(s'|s,a) represents the true environment dynamics - how states transition given actions. However, we typically don’t have access to these true dynamics.

Empirical Bellman Operators

In real-world settings, we work with a finite dataset D\mathcal D of transitions (s,a,r,s)(s,a,r,s'). This leads us to use an empirical Bellman operator that estimates values from samples (see Kumar et al, 2020 and Haskell et al, 2013)

B^πQ^k(s,a)=r(s,a)+γEaπ^k(as)[Q^k(s,a)]\hat {\mathcal B}^\pi \hat{Q}_k(s,a) = r(s,a) + \gamma\mathop\mathbb E_{a'\sim\hat\pi_k(a'|s')}[\hat Q_k(s',a')]

This allows us to write the practical Q-learning update as:

Q^k+1=arg minQEs,a,sD[B^πkQkQ(s,a)]2\hat Q_{k+1} = \argmin_Q \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\hat{\mathcal B}^{\pi_k} Q_k -Q(s,a)\Big]^2

Challenges in Offline Reinforcement Learning

Offline RL faces several fundamental challenges that make it significantly more difficult than online RL (Levine et al., 2020):

1. Action Distribution Shift

When learning from a fixed dataset, the learned policy πk\pi_k may attempt to take actions that were never seen in the training data. This creates an action distribution shift where:

Imagine trying to learn to cook only from watching cooking shows, without ever touching ingredients yourself. You might think certain techniques are easier than they really are because you can't verify your assumptions through practice.

2. Compounding Errors

Unlike online RL where the policy can correct its mistakes through environment interaction (Kumar et al., 2019):

3. Limited State Coverage

The fixed dataset may not cover important parts of the state space (Fu et al., 2020):

Conservative Value Estimation

(see Kumar et al, 2020)

The key insight of Conservative Q-Learning (CQL) is that Q-functions learned from offline data tend to be overoptimistic about unseen actions. To address this, CQL introduces a conservative value estimation approach.

Distribution Matching

Let’s define two key distributions:

Together, these form the joint distribution μ(s,a)=dπβ(s)μ(as)\mu(s,a) = d^{\pi_\beta}(s)\mu(a|s) that we’ll use for conservative estimation. The constraint supp(μ)supp(πβ)\text{supp}(\mu)\subset\text{supp}(\pi_\beta) ensures we only consider actions that exist in our dataset.

Conservative Q Update

The first CQL update equation adds a conservative term to the standard Bellman update:

Q^k+1=arg minQ[αEsD,aμ[Q(s,a)]conservative term+12Es,a,sD[BπkQkQ(s,a)]2standard Bellman error]\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha\mathop\mathbb E_{s\sim\mathcal D,a\sim\mu}[Q(s,a)]}_{\text{conservative term}}+ \underbrace{\frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2}_{\text{standard Bellman error}} \Bigg]

This equation has two parts:

  1. The standard Bellman error that drives Q-values toward their TD targets
  2. A new conservative term that pushes down Q-values under distribution μ\mu

The conservative term acts as a regularizer - it prevents Q-values from becoming too large for actions we haven’t seen in our dataset. The hyperparameter α\alpha controls how conservative we want to be.

Tightening the Lower Bound

While pushing down all Q-values helps prevent overoptimism, it might be too conservative. We can improve this by simultaneously maximizing Q-values for actions we’ve actually seen in our dataset:

Q^k+1=arg minQ[α[EsD,aμ[Q(s,a)]Es,aD[Q(s,a)]]relative conservative term+12Es,a,sD[BπkQkQ(s,a)]2]\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha \Big[\mathop\mathbb E_{s\sim\mathcal D,a\sim\mu}[Q(s,a)]- \mathop\mathbb E_{s,a\sim\mathcal D}[Q(s,a)] \Big]}_{\text{relative conservative term}} \newline+ \frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2 \Bigg]

This modified objective:

  1. Still pushes down Q-values for out-of-distribution actions (first expectation)
  2. But pushes up Q-values for in-distribution actions (second expectation)
  3. The difference between these terms creates a “gap” between seen and unseen actions

This relative conservative term helps maintain a meaningful ordering of Q-values while still preventing overoptimism about unseen actions. It’s like saying “make unseen actions look worse than the actions we know about” rather than just “make all actions look bad.”

The 1/2 term is a scaling factor commonly used in squared error loss functions because when we take the derivative, it cancels with the 2 from the power rule to give cleaner gradients. While this doesn't change which parameters minimize the loss, it makes the optimization process more numerically stable.

Conservative Q-Learning Algorithm

CQL faces a key design choice: how should we use our conservative Q-values for policy optimization? There are two main approaches:

  1. Full Policy Iteration: We could alternate between:

    • Computing conservative Q-values for the current policy π^k\hat{\pi}^k, and Performing one step of policy improvement
    • However, this approach is computationally expensive.
  2. Online Algorithm: Since our policy π^k\hat{\pi}^k is typically derived from the Q-function anyway, we can make the algorithm more efficient by:

    • Choosing μ(as)\mu(a|s) to approximate the policy that would maximize the current Q-function iterate
    • This gives us an “online” algorithm where Q-value estimation and policy improvement happen simultaneously

The authors implement the second approach by solving a min-max optimization problem:

Q^k+1=arg minQ[α[maxμEsD,aμ[Q(s,a)]Es,aD[Q(s,a)]]worst-case conservative estimate+12Es,a,sD[BπkQkQ(s,a)]2standard Bellman error+R(μ)regularizer]\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha \Big[\max_{\mu} \mathop\mathbb E_{s\sim\mathcal D,a\sim \mu}[Q(s,a)]- \mathop\mathbb E_{s,a\sim\mathcal D}[Q(s,a)] \Big]}_{\text{worst-case conservative estimate}} \newline+ \underbrace{\frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2}_{\text{standard Bellman error}} + \underbrace{\mathcal R(\mu)}_{\text{regularizer}} \Bigg]

The maxμ\max_{\mu} term actively searches for actions that might be overvalued, making our conservative estimate more robust. The regularizer R(μ)\mathcal R(\mu) prevents this search from becoming too aggressive. This formulation gives rise to a family of CQL variants, denoted as CQL(R\mathcal{R}), each characterized by a different choice of regularizer.

CQL(H): A Practical Variant

CQL(H) is one of two important variants in the CQL family, demonstrating how different choices of regularizer R(μ)\mathcal{R}(\mu) lead to different practical algorithms.

In CQL(H), we choose:

  1. The regularizer to be the negative KL-divergence against a prior: R(μ)=DKL(μρ)\mathcal{R}(\mu) = -D_{KL}(\mu || \rho)
  2. The prior ρ(as)\rho(a|s) to be a uniform distribution over actions

This choice leads to two key results:

  1. The optimal μ\mu has a closed-form solution (see Appendix A of the paper): μ(as)ρ(as)exp(Q(s,a))exp(Q(s,a))\mu^*(a|s) \propto \rho(a|s) \cdot \exp(Q(s,a)) \propto \exp(Q(s,a)) (The second proportionality follows because ρ\rho is uniform)

  2. When we plug this optimal μ\mu^* back into the objective, the first term becomes a soft maximum over Q-values:

Q^k+1=arg minQ[αEsD[logaexp(Q(s,a))soft maximumEaπ^β(as)[Q(s,a)]dataset values]+12Es,a,sD[BπkQkQ(s,a)]2]\hat Q_{k+1} = \argmin_Q \Bigg[ \alpha \mathop\mathbb E_{s\sim\mathcal D} \Big[ \underbrace{\log\sum_a \exp(Q(s,a))}_{\text{soft maximum}} - \underbrace{\mathop\mathbb E_{a\sim\hat\pi_\beta(a|s)}[Q(s,a)]}_{\text{dataset values}} \Big] \newline+ \frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2 \Bigg] The soft maximum (or "logsumexp") is always greater than or equal to the true maximum, making our conservative estimates even more conservative. This aligns with CQL's goal of preventing overoptimism about unseen actions.

There’s also a second variant where ρ(as)\rho(a|s) is set to the previous policy π^k1\hat{\pi}^{k-1}, which results in an exponentially weighted average of Q-values from actions sampled from π^k1(as)\hat{\pi}^{k-1}(a|s).