An Introduction to Reinforcement Learning

A mathematical introduction to modern Reinforcement Learning, from MDPs, through classic algorithms and Policy Gradients to Offline RL. Look for the callouts on the left, or click the 'More about this' button on a section for additional context.

Published on November 27, 2024. Updated on January 24, 2025.

Goal

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward

Modeling and Terminology

Basic reinforcement learning is modelled as a Markov decision process (MDP):

a set of environment and agent states, $S$
a set of actions, $A$ , of the agent
$P_a(s, s’) = Pr(s_{t+1} = s’ | s_t = s,a_t=a)$ is the probability of transition (at time $t$ ) from states $s \rightarrow s’$ under action $a$
$R_a(s, s’)$ is the immediate reward after transition from $s \rightarrow s’$ under action $a$

The agent’s action selection is modelled as a map called a policy function $\pi:A \times S \rightarrow [0,1]$ where $\pi(a,s) = Pr(a_t=a | s_t=s)$ . The policy map gives the probability of taking action $a$ in state $s$ . The policy can alternatively be deterministic maps from $s$ to $a$ .

The goal of a reinforcement learning agent is to learn an optimal policy $\pi^*$ which maximises the expected cumulative reward.

The value function $V_\pi(s)$ (or state-value function) is defined as the expected return starting with state $s$ and successively following policy $\pi$ . Hence, roughly speaking, the value function estimates “how good” it is to be in a given state.

V_\pi(s) = \mathop{\mathbb{E}}[R | s_0 = s] = \mathop{\mathbb{E}}[\sum_{t=0}^{\infty} \gamma^tr_t | s_0 = s]

The value function quantifies the long-term desirability of states. It sums up all future rewards the agent can expect to receive when starting from state s and following policy π, with future rewards discounted by γ to prioritize immediate rewards.

Value $V_\pi$ of a trajectory or policy $\pi$ at a state $s$ is the expected total reward $R$ from the state to the final destination; Here $\gamma$ is the discount rate of future rewards

Total reward is $R=\sum_{t=0}^{\infty} \gamma^tr_t$ where $\gamma \in [0,1)$

Given a state $s$ , action $a$ and policy $\pi$ , the action-value function of the pair $(s,a)$ under policy $\pi$ is the Q-function defined as

Q_\pi(s,a) = \mathop{\mathbb{E}}[R | s, a, \pi]

It is the expected total reward if we take action $a$ in state $s$ .

MDP Formulation

An MDP provides a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker. The "Markov" property means that the future state depends only on the current state and action, not on the history of previous states.

RL problems are formulated as an infinite-horizon discounted Markov decision process (MDP), defined by the tuple $(S,A,P,r,ρ_0,γ)$ , where $S$ is a finite set of states, $A$ is a finite set of actions, $P : S × A × S \rightarrow \mathbb{R}$ is the transition probability distribution $r:S→\mathbb{R}$ is the reward function, $ρ_0 :S→\mathbb{R}$ is the distribution of the initial state $s_0$ , and $γ ∈ (0, 1)$ is the discount factor.

Let $π$ denote a stochastic policy $π : S × A → [0,1]$ , and let $η(π)$ denote its expected discounted reward.

Goal (Criterion of optimality)

The optimal policy $\pi^*$ is defined in terms of the state-value function as

V^*(s) = max_\pi V_\pi(s)

i.e the optimal policy maximises the total reward possible from state $s$

MDP Algorithms

Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Both algorithms compute a sequence of functions $Q_k (k=0,1,2,…)$ that converge to $Q^*$ .

Computing these functions involves computing expectations over the whole state-space, which is impractical for all but the smallest (finite) MDPs. Instead, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces.

Both algorithms are based on the Bellman optimality equation.

Bellman Optimality

While this equation may look intimidating, it's essentially saying something intuitive: the best value we can achieve from a state equals the reward we get immediately plus the best value we can get from wherever we end up next. This recursive relationship is what makes dynamic programming solutions possible.

The Bellman optimality equation expresses the relationship between the value of a state and the values of its successor states under optimal behavior. It states that the value of a state under an optimal policy must equal the expected return for the best action from that state.

For the state-value function, the Bellman optimality equation is:

V^*(s) = \max_a \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V^*(s')]

And for the action-value function:

Q^*(s,a) = \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma \max_{a'} Q^*(s',a')]

The Bellman optimality equation is particularly important because its solution gives us the optimal value function, from which we can derive the optimal policy. Once we have $V^*$ , the optimal policy is to select actions that maximize the expected value:

\pi^*(s) = \argmax_a \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V^*(s')]

Policy Iteration

Policy iteration alternates between two steps:
(1) Policy evaluation - compute the value function for the current policy, and
(2) Policy improvement - update the policy to choose actions that maximize value based on the new value function.

In policy iteration (Howard 1960), the state-value function $V(s)$ is updated once, and then the policy $\pi(s, \cdot)$ is updated, then both are repeated until policy converges.

The algorithm starts at $i=0$ with a guess of initial value and policy. The steps are repeated until the policy stops changing.

V_{i+1}(s) := \sum_{s'} P_{\pi_i(s)}(s,s')[R_{\pi_i(s)}(s,s') + \gamma V_i(s')] \newline \pi_{i+1}(s) := \argmax_a [\sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V_{i+1}(s')]]

Value Iteration

Value iteration directly computes the optimal value function by repeatedly applying the Bellman optimality equation. Unlike policy iteration, it skips explicit policy improvement steps and instead finds the optimal value function in one shot, from which the optimal policy can be derived.

In value iteration (Bellman 1957), which is also called backward induction, the policy $\pi$ is not used.

V_{i+1}(s) := \max_a \Bigg\{ \sum_{s'} P_a(s,s')[R_a(s,s') + \gamma V_i(s')]\Bigg\}

Here, $i$ is the iteration number. Value iteration starts at $i=0$ with a guess $V_0$ and iteratively applies the equation until LHS and RHS converges.

Stochastic Model-free Algorithms

When the complete model (MDP) of the environment is not available, the agent can learn from experience (i.e by interacting with the environment).

Monte Carlo Value Iteration

Monte Carlo methods are computational algorithms that rely on repeated random sampling to obtain numerical results. They're used to solve problems that might be deterministic in principle, by using randomness to approximate solutions, especially useful for complex systems with many coupled degrees of freedom.

Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. Monte Carlo methods sample and average returns for each state-action pair and average rewards for each action.

They are typically applied to episodic tasks. The policy is updated only at the end of an episode.

(Note: Episodic Tasks are ones where the task terminates after some point. A game is an example)

s_i,a_i\sim \pi(\cdot,\cdot) \quad for\ one\ episode \newline R_{total}(s_i) = \sum_{j=i}^N R(s_i\rightarrow s_j, a_{ij}) \newline V(s_i) = avg(R_{total}(s_i))

Temporal Difference Learning

TD methods only wait until the next time step to update the value estimates.

The update is done by transitioning from $s_t \rightarrow s_{t+1}$ , collecting the reward $r_{t+1}$ and then correcting $V(s_t)$ using it.

The difference between expected reward $\hat{r}_{t+1} = V(s_t) - \gamma V(s_{t+1})$ and observed reward $r_{t+1}$ is fed back to $V(s_t)$ and is called the “temporal difference (TD) error”.

V(s_t) ← V(s_t)+α(r_{t+1}-\hat{r}_{t+1}) \newline = V(s_t)+α(r_{t+1} +γV(s_{t+1})−V(s_t))

After the optimal $V(s)$ is learnt, the policy $\pi(\cdot,\cdot)$ is determined by either with a greedy policy where the action taken at any state maximises the total reward. Or with an ɛ-greedy policy where the best action is taken with probability ɛ and a random action with probability 1-ɛ

Forms of TD Error

Temporal difference can have many steps of rewards before using the estimated value $V(s_{t+k})$ .

TD(1) error : $r_{t+1} +γV(s_{t+1})−V(s_t)$
TD(2) error : $r_{t+1} + γr_{t+2} +γ^2V(s_{t+2})−V(s_t)$
TD(k) error : $r_{t+1} + γr_{t+2} + γ^2r_{t+3} + ... +γ^{k-1}r_{t+k} +γ^kV(s_{t+k})−V(s_t)$

Q-Learning

Q-learning is similar TD-learning but the action-value (Q) function is updated instead of state-value (V) function.

Q(s_t, a_t) \leftarrow Q(s_t,a_t) + \alpha (r_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t,a_t))

Deep Q-Learning

This is the same as Q-learning but the Q-function is approximated using a neural network $Q(s,a,\theta)$ where $\theta$ are the weights of the network.

The loss function is called “mean squared TD error” (Sutton & Barto, 2018, p. 199) and is computed by squaring the TD-error.

Loss = [r_{t+1} + \gamma \max_a Q(s_{t+1}, a,\theta) - Q(s_t, a_t,\theta)]^2

A stationary distribution is one that doesn't change over time - its statistical properties remain constant.

This loss function causes divergence because states are correlated, and rewards are non-stationary (for e.g, reward for same state changes over time).

To deal with correlated states, the agent builds a dataset of experience and then takes random samples from the dataset, to break the correlation. This is called Experience Replay (Mnih et al., 2013).

The agent fixes the parameters $\theta^-$ and then periodically updates the fixed parameters (i.e on every K-th iteration).

Loss = [r_{t+1} + \gamma \max_{a'} Q(s',a',\theta^-) - Q(s,a,\theta)]^2

This can be improved by prioritising the states picked up from the experience dataset, using 2 different networks for $\theta$ and $\theta^-$ , etc.

Regular experience replay can struggle with sparse rewards where successful experiences are rare. For example, a robot might try thousands of times to grab an object without succeeding, giving no useful learning signal.

Hindsight Experience Replay (Andrychowicz et al., 2017) addresses this by treating failures as successes for a different goal.

When the agent fails its intended goal, HER saves both the original experience and a modified version where the achieved outcome becomes the goal. This lets the agent learn from every attempt, even failed ones. The approach works by augmenting the experience tuple $(s_t, a_t, r_t, s_{t+1}, g)$ with additional versions using achieved goals $g'$ from the same episode.

Policy Gradients

Here we learn the policy $\pi(\cdot,\cdot)$ directly instead of learning V or Q functions and deriving a policy from them. The policy is approximated with a neural network $\pi(s, a, \omega)$ and the parameters $\omega$ are updated using policy gradients.

The expected cumulative reward for a policy with parameters $\omega$ is given by $J(\omega)$

The best policy $\pi^*(\cdot, \cdot)$ is found by doing gradient ascent on the cumulative reward i.e find the $\omega$ resulting in the largest cumulative reward.

The notation τ∼π(ω) means τ is sampled from (or follows) the distribution defined by the policy π with parameters ω. It's similar to how x∼N(μ,σ²) means x is sampled from a normal distribution with mean μ and variance σ².

J(\omega) = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[R(\tau)] = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[\sum_{i=0}^N \gamma^ir_{t+i}] = \int\pi(\tau,\omega)R(\tau) d\tau

Here, $\tau$ is a trajectory of $(s,a)$ pairs, and $\pi(\omega)$ is the distribution of trajectories defined by the policy $\pi$ with parameters $\omega$ .

The $\omega$ parameters are then updated as $\omega_{t+1} = \omega_{t} + \alpha \nabla_\omega J(\omega_t)$ .

Since the gradient of $J(\omega)$ involves differentiating products, we use logarithms to simplify. This is done using the policy gradient theorem which uses the following identity.

\nabla_\omega f(x,\omega) = f(x,\omega) \frac{\nabla_\omega f(x,\omega)}{f(x,\omega)} = f(x,\omega)\nabla_\omega \log f(x,\omega)

The policy gradient theorem assumes: 1) The policy is differentiable 2) State transitions and rewards are Markovian 3) Episodes have finite length

With this, the J-function’s gradient is computed as

\nabla_\omega J(\omega) = \int\nabla_\omega \pi(\tau,\omega)r(\tau) = \int\pi(\tau,\omega)\nabla_\omega \log \pi(\tau,\omega)r(\tau) \newline = \mathop{\mathbb{E}}_{\tau\sim\pi(\omega)}[r(\tau)\nabla_\omega \log \pi(\tau,\omega)]

See Lilian Weng, 2018 for a detailed derivation.

The gradient of J(w) is represented as an expectation of a process, and we can sample from the process to approximate the distribution (i.e Monte Carlo).

The log-policy $\nabla \log \pi$ can be expanded as (assuming a discrete infinite state space)

\pi(\tau,\omega) = \pi_\omega(s_1, a_1, s_2,...) = p(s_1)\prod_{t=1}^\infty\pi_\omega(a_t|s_t)p(s_{t+1}|s_t,a_t) \newline \log \pi_\omega(\tau) = \log p(s_1) + \sum_{t=1}^\infty \log \pi_\omega(a_t|s_t) + \sum_{t=1}^\infty \log p(s_{t+1}|s_t,a_t)

The derivative of the log-policy is then

\nabla_\omega\log\pi_\omega(\tau) = \sum_{t=1}^\infty \nabla_\omega \log \pi_\omega(a_t|s_t)

Substituting the reward function as a sum, limiting the time horizon to T, and expanding the expectation as mean over N iterations, the loss function becomes

Loss = g :=\nabla_\omega J(\omega) = \frac{1}{N}\sum_{i=1}^N \Bigr[\sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \Bigr] \Bigr[\sum_{t=1}^T r(s_{i,t},a_{i,t}) \Bigr]

The complete vanilla policy gradient algorithm is

Sample $N$ trajectories $\{\tau_i\}$ from $\pi_\omega(a_t|s_t)$ , each with a maximum duration of T
Compute the loss function with existing policy neural network
Back-propagate the loss into the policy neural network with an optimiser

See Sutton et al, 2000.

Note: The total reward is now represented in an un-discounted way. This is discussed later.

The vanilla policy gradient algorithm has a major issue: it updates the policy based on the total reward of the trajectory, which means actions early in the trajectory are credited/blamed for rewards that came after them. This is inefficient and can lead to incorrect learning.

REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) solves this by using temporal causality - each action is only updated based on the rewards that came after it. The algorithm is:

Sample trajectory $\tau$ from policy $\pi_\omega$
For each timestep t:
- Compute return $G_t = \sum_{k=t}^T \gamma^{k-t}r_k$ (only future rewards)
- Update policy: $\omega \leftarrow \omega + \alpha\gamma^t G_t\nabla_\omega\log\pi_\omega(a_t|s_t)$

This gives a more focused learning signal for each action based only on its consequences.

See Sutton et al, 2000 and Williams, 1992 for more details.

Benefit of log-policy over naive policy

The conversion of a policy to the log-policy prevents the loss from exploding, or going to zero when the states highly correlated (which causes gradients to be large or small).

High Variance of Policy Gradients

The policy gradient algorithm suffers from high variance because:

The stochastic policy $\pi$ can generate different actions for the same states, leading to different trajectories
These different trajectories can result in vastly different total rewards
This variance in rewards creates noisy gradient estimates, making it difficult for the algorithm to converge to an optimum

While Monte Carlo sampling in policy gradients is unbiased (meaning it gives the correct gradient in expectation), its high variance means individual gradient estimates can point in conflicting directions.

Solutions to reduce variance:

Larger batch sizes: Averaging over more trajectories reduces variance in the gradient estimate. However, this comes at the cost of reduced sample efficiency since more environment interactions are needed per update.
Baseline subtraction: Introduce a baseline (typically the state-value function $V(s)$ ) to center the rewards. This changes the objective from tracking absolute rewards to tracking relative advantages:
- Original: $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) R_t]$
- With baseline: $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) (R_t - V(s_t))]$

The baseline reduces variance by normalizing the scale of rewards. To understand why:

Consider two actions in a state:
- Action A: Usually gives rewards of either 90 or 110
- Action B: Usually gives rewards of either -10 or 10
Without a baseline:
- For action A, the policy gradient will use large values (90 or 110)
- For action B, the policy gradient will use smaller values (-10 or 10)
- These different scales make learning difficult and increase variance
With a baseline V(s) = 100 for action A's state and V(s) = 0 for action B's state:
- Action A's advantages become: -10 or +10 (from 90-100 or 110-100)
- Action B's advantages remain: -10 or +10
- Now both actions use similar scales, reducing variance

The baseline doesn't change which action is better (their relative ordering stays the same), but it makes the learning signal more consistent by removing the effect of different reward scales across states.

The baseline reduces variance without introducing bias, since $\mathbb{E}_{\pi_\theta}[\nabla_\theta \log \pi_\theta(a_t|s_t) V(s_t)] = 0$ .

Advantage of an action in a state over a baseline

We can add or subtract a term to the optimisation problem as long as the term is not related to the parameters $\omega$ .

g = \frac{1}{N}\sum_{i=1}^N \Bigr[\sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \Bigr] \Bigr[\sum_{t=1}^T r(s_{i,t},a_{i,t}) - b(s_{i,t})\Bigr]

The first term of the reward sum is the action-value function $Q_\pi(s,a)$ and the second term is a baseline term. Their difference is called the advantage $A_\pi(s,a)$ of an action $a$ in state $s$ , when using policy $\pi$ , over the baseline $b$

g = \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \nabla_\omega \log \pi_\omega(a_t|s_t) \ A_\pi(s_{i}, a_{i})

The most common baseline is the state-value function $V_\pi(s)$ which represents the total reward expected at state $s$ when following policy $\pi$ . So $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$

With a baseline, the updated algorithm is

Sample $N$ $N$ trajectories $\{\tau_i\}$ ${τ_{i}}$ from $\pi_\omega(a_t|s_t)$ $π_{ω} (a_{t} ∣ s_{t})$ , each with a maximum duration of T
1. At each time-step in each trajectory, compute total reward $R_t = \sum_{k=t}^Tr_k$ and advantage estimate $A_t = R_t - b(s_t)$
2. The total reward can also be discounted $R_t = \sum_{k=t}^T \gamma^{k-t}r_k$
Re-fit the baseline by minimising the error $||b(s_t) - R_t||^2$ over all trajectories and time steps
Compute the new loss function
Back-propagate the loss into the policy neural network with an optimiser

See Schulman et al, 2016 for more details.

Reward variations

Causal Rewards: The loss function can be updated to make the reward calculation causal, by computing rewards from $t \rightarrow T$ instead of from $1 \rightarrow T$
Discounted Rewards: Reward discount ( $\gamma$ ) reduces variance by reducing the impact of distant actions. This is also strictly necessary to interpret the loss function as a true gradient.

General Policy Gradient Loss

Schulman et al, 2016 explains that policy gradient methods maximise the expected total reward by repeatedly estimating the gradient $g := ∇_θ \mathbb E[\sum_{t=0}^∞ r_t]$ . There are several different related expressions for the policy gradient, which have the form.

g = \mathop{\mathbb{E}}_{\tau\sim \pi} \Bigr[\sum_{t=0}^\infty \Psi_t \nabla_\theta \log \pi_\theta(a_t | s_t) \Bigr]

Where $\Psi_t$ may be one of the following

$\sum_{t=0}^\infty r_t$ : total reward of the trajectory
$\sum_{t=T}^\infty r_t$ : total reward following action $a_T$
$\sum_{t=T}^\infty r_t - b(s_T)$ : total reward following $(s_T,a_T)$ w.r.t a baseline
$Q_\pi(s_t,a_t)$ : state-action value function
$A_\pi(s_t,a_t)$ : advantage function
$r_{t+1} + V_\pi(s_{t+1}) - V_\pi(s_t)$ : One-step TD residual

Problems with Policy Gradients

An analogy might help understand these challenges:

Imagine teaching someone to cook by giving them recipe feedback. Policy Gradients is like:

Having to throw away all previous cooking notes after each attempt (sample efficiency)
Tiny adjustments to ingredient amounts sometimes completely ruining the dish (parameter space mismatch)
Not properly considering how early cooking steps affect the final taste (gradient estimation)

This is why modern RL often uses more sophisticated methods that build guardrails around these issues, similar to how cooking schools provide structured learning environments rather than pure trial and error.

There are three main problems with policy gradients:

Poor Sample Efficiency
- Each policy update invalidates previously collected data since the policy has changed
- Unlike Deep Q-Networks where old experiences remain useful, PG methods require fresh data for each update
- This is because PG uses on-policy expectations - we can only use data collected from the current policy
Parameter-Policy Space Mismatch
- The parameter space ( $\omega$ ) and policy space ( $\pi$ ) are different manifolds with different metrics, meaning distances/lengths in one space don’t correspond to equivalent distances in the other
- This means tiny parameter updates can cause unexpectedly large policy changes, making optimization unstable, and making it difficult to control how much the policy actually changes during updates
- The disconnect between parameter updates and policy behavior makes finding appropriate learning rates challenging
Gradient Estimation Issues
- The policy gradient estimator $g(\cdot,\theta)$ doesn’t properly incorporate discounted rewards
- This leads to biased gradient estimates that can harm convergence
- The bias comes from treating all time steps equally instead of properly weighting future rewards

See the callout for more details on these issues.

Let's break down the math behind issues 1 and 2:

1. Sample Efficiency Issue

When we collect data using policy π₁ and then update to π₂, we need to compare how likely each trajectory is under both policies. This is done using the probability ratio:

P(τ|π₂)/P(τ|π₁) = ∏ᵢ π₂(aᵢ|sᵢ)/π₁(aᵢ|sᵢ)

This ratio represents how much more/less likely the trajectory τ is under the new policy π₂ compared to the old policy π₁. Since it's a product of many terms (one for each timestep), even small differences between π₁ and π₂ get amplified exponentially, making old data unusable.

2. Parameter-Policy Space Mismatch

The parameter space θ and policy space π have fundamentally different distance metrics. In parameter space, we use Euclidean distance ||θ₁ - θ₂||, while in policy space we measure distances using KL divergence KL(π_θ₁||π_θ₂). These metrics don't align:

KL(π_θ₁||π_θ₂) ≠ c||θ₁ - θ₂||

where KL is Kullback-Leibler divergence and c is any constant. This means a small step in parameter space can cause a large change in policy space. The relationship is highly non-linear and makes optimization unstable.

Importance Sampling

Basics

Sample efficiency can be improved by taking trajectories from older policies and changing the loss function to account for the difference.

Importance sampling is a technique for estimating expectation of a function $f(x),x \sim P(x)$ using samples drawn from a different distribution $Q(x)$ .

\mathop\mathbb{E}_{x\sim P}[f(x)] = \mathop\mathbb{E}_{x\sim Q}\Bigr[f(x)\frac{P(x)}{Q(x)}\Bigr]

The ratio $W(x) = P(x)/Q(x)$ is the importance sampling weight for x. Let $\hat{\mu}_Q$ be our importance sampling estimator - the average of $f(x)W(x)$ over N samples drawn from Q. The variance of this estimator is:

var(\hat{\mu}_Q) = \frac{1}{N}var(f(x)W(x)) \newline = \frac{1}{N} \Bigr( \mathop\mathbb{E}_{x\sim Q}[f(x)W(x)]^2 - [\mathop\mathbb{E}_{x\sim Q}f(x)W(x)]^2 \Bigr) \newline = \frac{1}{N} \Bigr( \mathop\mathbb{E}_{x\sim P}\Big[ \frac{P(x)}{Q(x)}f(x)^2 \Big] - [\mathop\mathbb{E}_{x\sim P}f(x)]^2 \Bigr)

The first term can be a major problem because it involves the ratio P(x)/Q(x). If Q(x) is small in regions where P(x)f(x)² is large, this ratio explodes and causes extremely high variance. This insight leads to several key developments in later sections:

In policy gradients, this manifests as the ratio of policy probabilities becoming unstable when policies diverge too much
Trust region methods like TRPO address this by explicitly constraining how much policies can change
PPO simplifies this by clipping the ratio to a fixed range, providing a more practical solution

This variance issue is fundamental to why naive policy gradient methods can be unstable and why constraining policy updates becomes so important.

For a detailed treatment of importance sampling, see Owen, 2013 which covers both the theory and practical considerations.

Usage in Policy Gradients

Using the importance sampling technique described above, we can improve the efficiency of policy gradient methods. Instead of sampling trajectories from the current policy $\omega$ being optimized, we can reuse trajectories from an older policy $\omega'$ by applying importance sampling weights:

g = \nabla_\omega J(\omega) = \mathop\mathbb E_{\tau\sim\omega}\Big[ \sum_{t}\gamma^t\nabla_\omega\log \pi_\omega(a_t|s_t) A_\omega^\pi(s_t,a_t) \Big] \newline = \mathop\mathbb E_{\tau\sim\textcolor{red}{\omega'}}\Big[ \sum_{t} \textcolor{red}{\frac{P(\tau_t|\omega)}{P(\tau_t|\omega')}} \gamma^t\nabla_\omega\log \pi_\omega(a_t|s_t) A_\omega^\pi(s_t,a_t) \Big]

However, this introduces a new challenge. Each term $P(\tau|\cdot)$ is a product of state transition and policy probabilities: $p(s_0) \prod p(s_{t+1}|s_t,a_t)\pi(a_t|s_t)$ . Due to this product structure, even small differences between policies can cause the importance sampling weight to become extremely large or small, leading to the high variance problem discussed earlier. This variance issue will be addressed by Natural Policy Gradient (NPG) and subsequent methods.

Monotonic Improvement Algorithms

Monotonic Improvement Algorithms are a class of policy optimization methods that guarantee each policy update will improve (or at least not decrease) the expected reward. This directly addresses the variance issues we saw with importance sampling in policy gradients, where large policy changes could lead to unstable updates.

The key idea is to bound policy changes while still maximizing improvement. By carefully controlling how much policies can diverge from each other, these methods avoid the exploding importance sampling ratios we discussed earlier.

Relative Policy Performance Identity

For two policies $\pi$ and $\pi’$ , the relative performance can be computed as (see Schulman et al, 2017 Appendix A for proof)

J(\pi')-J(\pi) = \mathop\mathbb E_{\tau\sim\pi'} \Big[\sum_t \gamma^t A_\pi(s_t,a_t) \Big]

We can use this identity to sample from old policy $\pi$ and use it to improve the current one $\pi’$ .

\max_{\pi'} J(\pi') = \max_{\pi'} [J(\pi') -J(\pi)] = \max_{\pi'} \mathop\mathbb E_{\tau\sim\pi'} \Big[\sum_t \gamma^t A_\pi(s_t,a_t) \Big]

This still requires trajectories sampled from $\pi’$ . To fix this, first rewrite the expectation in terms of states and actions (see the Ergodic Theorem).

J(\pi') -J(\pi) = \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\pi', s\sim d_{\pi'}} \Big[A_\pi(s_t,a_t) \Big] \newline = \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\textcolor{red}{\pi}, s\sim d_{\pi'}} \Big[ \textcolor{red}{\frac{\pi'(a|s)}{\pi(a|s)}} A_\pi(s_t,a_t) \Big]

where $d_\pi(s)$ is the discounted future state distribution (normalized to 1) and is defined as

d_\pi(s) = (1-\gamma) \sum_{t=0}^\infty \gamma^t P(s_t=s|\pi) \newline = (1-\gamma)[P(s_0=s) +\gamma P(s_1=s)+\gamma^2P(s_2=s)+..]

The discounted future state distribution d_π(s) represents the normalized discounted sum of probabilities of being in state s at each timestep t when following policy π. The (1-γ) term normalizes the infinite sum to create a proper probability distribution.

While similar in concept, this is distinct from the stationary distribution of the Markov chain induced by π. The stationary distribution, if it exists, is the limiting distribution as t→∞, while d_π(s) explicitly weights earlier timesteps more heavily through the discount factor.

The state visitation frequency ρ_π(s) measures how often a policy visits each state in practice. For infinite horizon problems with γ=1, d_π(s) converges to ρ_π(s), making it a useful theoretical bridge between discounted and average reward settings.

This equation still has states sampled from $\pi’$ which can be fixed as

J(\pi') -J(\pi) \approx \frac{1}{1-\gamma} \mathop\mathbb E_{a\sim\pi, s\sim \textcolor{red}{d_{\pi}}} \Big[ \frac{\pi'(a|s)}{\pi(a|s)} A_\pi(s_t,a_t) \Big] = \mathcal{L}_\pi(\pi')

Where $\mathcal L_\pi(\pi')$ is the lower bound or local approximation of the relative performance.

It can be proven that the residual error $J(\pi’)-J(\pi)-\mathcal L_\pi(\pi’)$ has an upper bound related to the KL-divergence between the two policies. See Constrained Policy Optimization; Achiam et al, 2017 for proof. So if the divergence is very small, this approximation is very accurate.

J(\pi’)-J(\pi)-\mathcal L_\pi(\pi’) ≤ C \sqrt{\mathbb E_{s\sim d_\pi}[D_{KL}(\pi'||\pi)[s]] }

Majorize-Minimization

Since $\mathcal L - C \sqrt{\mathbb E[D_{KL}(..)]}$ is a lower bound on the relative performance between $\pi$ and $\pi’$ , we can use it as a surrogate objective, instead of the main objective. Maximising it will always maximise the true objective. See MM algorithm

The Majorize-Minimization (MM) algorithm is a general optimization technique that works by iteratively constructing and optimizing surrogate functions.

At each step, it creates a surrogate function that "majorizes" (upper bounds) the objective function and touches it at the current point. By minimizing this surrogate, we're guaranteed to improve or maintain the original objective.

The key idea is that the surrogate should be easier to optimize than the original function while still preserving the essential properties that ensure convergence. This approach is particularly useful when the original objective is difficult to optimize directly.

Here, both $\mathcal L$ and the KL-divergence can be estimated from samples of the policy $\pi$

Note: The value of C provided in the paper can be high if $\gamma$ is near 1. Their recommendation is to use a KL constraint instead of KL-penalty implied in the original equation.

Constraint Equation

\pi_{k+1} = \argmax_{\pi'} \mathcal L_{\pi_k} (\pi') \newline such\ that\ \mathbb E_{s\sim d_{\pi_k}}[D_{KL}(\pi',\pi_k)[s]] ≤ \delta

The $\delta$ is called a trust region (see Trust region)

This algorithm fixes the two failure modes of sample efficiency and incorrect length measurements.

Natural Policy Gradient

While the policy gradient methods above work, they can be inefficient since they don’t account for the geometry of the policy space. Natural Policy Gradient (NPG) addresses this by using information geometry to find better update directions (see A Natural Policy Gradient).

Metric Tensor Basics

A metric tensor $G$ defines how we measure distances in a space. In the context of policy optimization, this is crucial because it helps us measure how “different” two policies are.

Think of the metric tensor like a "ruler" that changes based on where you are in the space. Just like measuring distance on a curved surface requires knowing how the surface bends, measuring distance between policies requires knowing how the policy space curves.

The squared distance between nearby points $\vec{v}$ and $\vec{v}+\delta v$ is given by:

dist^2(\vec{v}, \vec{v}+\delta v) = \delta v^T G(v) \delta v

In the simplest case (orthogonal coordinates), $G$ is just the identity matrix, giving us the familiar Euclidean distance:

dist^2(\vec{v}, \vec{v}+\delta v) = \delta v^T \delta v = \sum_i (\delta v_i)^2

When we change coordinate systems from $\vec{v}$ to $\vec{w}$ , the relationship between small changes in each system is:

\delta v_i = \sum_j \frac{\partial v_i}{\partial w_j} \delta w_j

We can write this more compactly using the Jacobian matrix $A$ where $A_{ij} = \partial v_i/\partial w_j$ :

\delta v = A \delta w

The metric tensors in the two coordinate systems ( $G_v$ and $G_w$ ) are related by:

G_w = AG_v A^T

Similarly, gradients transform as:

g_w = Ag_v

The natural gradient direction in each coordinate system ( $\vec{dv} = G^{-1}_v g_v$ and $\vec{dw} = G^{-1}_w g_w$ ) are related by:

\vec{dw} = (A^T)^{-1}\vec{dv}

This last equation shows that $G^{-1}g$ transforms properly under coordinate changes - it’s a covariant vector. This is why natural gradients are “natural”: they give us update directions that are invariant to how we parameterize our policy.

Metric Tensor for Probability Spaces

(see Amari, S. (1998))

When optimizing policies, we want to find the direction that improves our objective $J(\theta)$ most efficiently. This is the direction of steepest descent - the vector $d\theta$ that minimizes $J(\theta+d\theta)$ while keeping the step size $|d\theta|^2$ constant.

If we measure distances using a metric tensor $G$ , then the steepest descent direction is $G^{-1}\nabla J(\theta)$ . Crucially, this metric should depend on the underlying probability space our policy operates in, not on how we happened to parameterize it.

The Fisher Information Matrix measures how much a probability distribution changes when we adjust its parameters. It characterizes the local curvature of the probability space by measuring changes in the log-likelihood – regions where small parameter changes cause large changes in log-likelihood will have high Fisher Information.

For our policy optimization, F_s(θ) measures how sensitive our policy's action distribution is to parameter changes in state s. Small changes to parameters that dramatically alter the policy's log-likelihood in a state will result in high F_s(θ) values for that state.

The full metric G averages these state-specific sensitivities according to how often we visit each state. This means we're naturally more careful with parameters that affect frequently-visited states, helping prevent destabilizing changes to critical parts of our policy.

The Fisher Information Matrix provides the appropriate metric tensor for our optimization, as it captures the local geometry of the probability space. For each state $s$ , our policy $\pi(a;s,\theta)$ defines a probability distribution over actions, parameterized by $\theta$ . The Fisher Information Matrix for this distribution is:

F_s(\theta) = \mathbb E_{\pi(a;s,\theta)} \Bigr[ \frac{\partial \log \pi(a;s,\theta)}{\partial \theta_i} \frac{\partial \log \pi(a;s,\theta)}{\partial \theta_j} \Bigr]

Since our policy operates over many states, we average these matrices according to how often we visit each state, giving us our final metric tensor:

G \equiv F(\theta) = \mathbb E_{s\sim d_\pi}[F_s(\theta)]

where $d_\pi$ is the stationary state distribution under policy $\pi$ .

Changes needed

The update rule for the network parameters is $\theta_{k+1} = \theta_k + \alpha g$ where $g$ is the policy gradient’s estimator.

The first change is to use the gradient with a proper metric i.e $g \rightarrow F^{-1}g$

The second change is to compute $\alpha$ to satisfy the KL-divergence constraint specified in the importance sampling section (see constraint equation).

Derivation of metric and coefficient

The lower-bound $\mathcal L$ and KL-divergence equations are expanded via Taylor series as

L_{\theta_k}(\theta) \approx L_{\theta_k}(\theta_k) + g\Delta\theta

D_{KL}(\theta||\theta_k) \approx D_{KL}(\theta_k||\theta_k) + \Delta\theta^T \nabla_\theta D_{KL}(\theta||\theta_k)|_{\theta=\theta_k} \newline + \frac{1}{2} \Delta\theta^T \nabla^2_\theta D_{KL}(\theta||\theta_k)|_{\theta=\theta_k}\Delta\theta

where $\Delta\theta=\theta-\theta_k$ is the proposed update of parameters.

Here, the first part of both expansions are zero. The first-derivative term of the KL-divergence expansion is also zero as shown in Slide 33 of Lecture 13: Natural Policy Gradients

The second-derivative part simplifies to $\nabla^2_\theta D_{KL}(..) = \mathbb E_{\theta_k}[\nabla_\theta\log\pi\cdot\nabla_\theta\log\pi^T]$ which is the Fischer information matrix.

F(\theta_k) = \nabla^2_\theta D_{KL}(\pi_\theta||\pi_{\theta_k}) \newline D_{KL}(\pi_\theta||\pi_{\theta_k}) = \frac{1}{2}\Delta\theta^TF(\theta_k)\Delta\theta

The best $\Delta\theta$ that we can get is found by solving the following optimization problem which is equivalent to the constrained problem specified earlier.

\Delta\theta^* = \argmax_{\Delta\theta=\theta-\theta_k} \Big[ L_{\theta_k}(\theta) - \lambda D_{KL}(\pi_\theta||\pi_{\theta_k}) \Big] \newline = \argmin_{\Delta\theta} \Big[-g\Delta\theta+\frac{1}{2}\lambda\Delta\theta^TF(\theta_k)\Delta\theta \Bigr]

( $\argmax$ became $\argmin$ by a multiplication of $-1$ throughout)

The term inside the $\argmin$ can be differentiated w.r.t $\Delta\theta$ and set to 0 to find the minimum.

0 = -g + \lambda F(\theta_k)\Delta\theta^* \newline or:\quad \Delta\theta^* = \lambda'F(\theta_k)^{-1}g

The update rule is now $\theta_{k+1} \leftarrow \theta_k + \alpha F(\theta_k)^{-1}g_k$

To derive the learning rate $\alpha$ , we simplify the KL-divergence expansion and substitute $\Delta\theta = \alpha F^{-1}g$ . Since F is symmetric, $F=F^T$ and $(F^{-1})^T=F^{-1}$

D_{KL}(\theta||\theta_k) \approx \frac{1}{2}\Delta\theta^TF(\theta_k)\Delta\theta = \frac{1}{2}(\alpha F^{-1}g)^TF(\alpha F^{-1} g) ≤ \delta \newline \Rightarrow \alpha ≤ \sqrt{\frac{2\delta}{g^TF^{-1}g}}

Algorithm

Collect trajectories on the policy $\pi_k = \pi(\theta_k)$
Estimate advantages $A_t^{\pi_k}$ using any advantage estimation algorithm
Form sample estimates for
1. Policy gradient $g_k$ using the advantage estimates
2. Fisher information (or KL-divergence hessian) $F_k$
Compute the parameter update as follows and backpropagate

\theta_{k+1} = \theta_k + \sqrt{\frac{2\delta}{g_k^TF^{-1}_kg_k}}F^{-1}_kg_k

Truncated NPG

While Natural Policy Gradient provides theoretically sound updates, it faces a practical computational challenge: the Fisher Information Matrix $F$ grows quadratically with the number of parameters $N$ . Specifically:

The matrix has size $N \times N$ (or $N^2$ elements) as it is the Hessian of the KL-divergence
Computing its inverse has complexity $\mathcal{O}(N^3)$
For modern neural networks where $N$ can be millions, this is computationally infeasible

To make NPG practical, we can avoid explicitly computing $F^{-1}$ by using the conjugate gradient algorithm. Instead of directly computing $x = F^{-1}g$ , we reformulate the problem as solving the linear system:

Fx = g

The conjugate gradient method is an iterative algorithm for solving linear systems Ax=b where A is symmetric and positive definite. Rather than computing A⁻¹ directly, it:

Starts with an initial guess x₀
Iteratively improves the solution by moving along carefully chosen directions
These directions are "conjugate" (A-orthogonal) to each other, ensuring efficient convergence
Often finds good approximate solutions in far fewer than n iterations

Think of conjugate gradient as finding the valley floor of a quadratic bowl (Fx=g) by taking clever steps that are always perpendicular to previous steps with respect to the bowl's shape (F). This is much more efficient than trying to map out the entire bowl (computing F⁻¹).

The conjugate gradient method is particularly efficient because:

We never need to store $F$ explicitly - we only need to compute matrix-vector products $Fv$
These products can be computed efficiently using automatic differentiation
We can truncate the iteration early once we have a “good enough” solution
The algorithm naturally exploits any sparsity or structure in $F$

This “truncated” version of NPG maintains most of the benefits of natural gradient descent while being computationally tractable for large neural networks. See Lecture 5: Advanced Policy Gradient Methods for more details.

Trust Region Policy Optimization

While NPG provides a theoretically sound way to compute update directions, there’s still a practical challenge: choosing the right step size. The trust region parameter $\delta$ that we derived earlier might be:

Too large, causing the policy to change too drastically
Too small, leading to slow learning
Inappropriate due to the quadratic approximation breaking down

A trust region is like a safety zone where we believe our approximations are reliable. If we step outside this zone, our estimates of improvement might be wrong. Line search helps us find the largest safe step we can take within this zone.

To address this, TRPO uses line search - an adaptive way to scale the update. The idea is to try progressively smaller steps until we find one that satisfies our constraints and improves the policy. See Schulman et al. (2015) for more details.

TRPO combines two key ideas:

Natural gradient for computing good update directions
Line search for finding the largest safe step size

This makes it more robust than basic policy gradient methods while still being computationally tractable.

Line Search Algorithm

Compute the proposed step direction $\Delta\theta_k$ (as we’ll see below)
For $j = 0$ $j = 0$ to $L$ $L$ :
- Try the update: $\theta = \theta_k + \alpha^j \Delta\theta_k$
- If both conditions are met:
  - Policy improves: $\mathcal L_{\theta_k} (\theta) ≥ 0$
  - Within trust region: $D_{KL}(\theta\|\theta_k) ≤ \delta$
- Then accept: $\theta_{k+1} = \theta_k + \alpha^j \Delta\theta_k$
- Otherwise continue with smaller step ( $\alpha$ decreases exponentially)

Complete TRPO Algorithm

Sample Collection:
- Collect trajectories using current policy $\pi_k = \pi(\theta_k)$
Advantage Estimation:
- Estimate advantages $A_t^{\pi_k}$ for each state-action pair
Gradient Computation:
- Compute policy gradient $g_k$
- Estimate Fisher information matrix $F_k$
Natural Gradient:
- Use conjugate gradient (with $N_{cg}$ iterations) to compute $x_k \approx F^{-1}_kg_k$
Step Sizing:
- Compute the proposed step: $\Delta\theta_k = \sqrt{\frac{2\delta}{x_k^TF_kx_k}}x_k$
Line Search:
- Use backtracking line search to find final update: $\theta_{k+1} = \theta_k + \alpha^j \Delta\theta_k$

Proximal Policy Optimization

Instead of using the natural gradient $F^{-1}g$ and line search to exactly enforce the KL-divergence constraint, PPO uses simpler methods to approximately enforce this constraint. This achieves similar benefits to TRPO’s trust region optimization but with less computational overhead.

There are two variants

Adaptive KL Penalty

The first PPO variant uses an adaptive penalty coefficient to softly enforce the KL-divergence constraint. Instead of explicitly constraining how much the policy can change (as in TRPO), it adds a penalty term to the objective:

The policy update solves an unconstrained optimization problem: $\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}(\theta) -\beta_k D_{KL}(\theta||\theta_k)$

Here, $\beta_k$ controls how strongly we penalize changes to the policy - higher values of $\beta_k$ mean smaller policy updates.
After each update, we adjust $\beta_k$ based on whether the actual KL-divergence was too large or too small:
- If the policy changed too much (KL too high), we increase $\beta_k$ to enforce smaller updates
- If the policy changed too little (KL too low), we decrease $\beta_k$ to allow larger updates

This adaptive scheme maintains the benefits of trust region optimization while being simpler to implement than TRPO’s exact constraint enforcement.

Algorithm

Collect partial trajectories on the policy $\pi_k = \pi(\theta_k)$
Estimate advantages $A_t^{\pi_k}$ using any advantage estimation algorithm
Compute policy update as
1. $\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}(\theta) -\beta_k D_{KL}(\theta||\theta_k)$
2. by taking K steps of minibatch SGD (via Adam)
if $D_{KL}(\theta_{k+1}||\theta_k) ≥ 1.5 \delta$ then $\beta_{k+1} = 2 \beta_k$
if $D_{KL}(\theta_{k+1}||\theta_k) ≤ \delta/1.5$ then $\beta_{k+1} =\beta_k/2$

Clipped objective function

The second PPO variant takes an even simpler approach to constraining policy updates. Rather than using an adaptive KL penalty, it directly clips the importance sampling ratio to prevent large policy changes.

Recall that in the previous approach, we used a KL penalty term to softly constrain how much the policy could change. Here, we instead modify the objective function $\mathcal L_{\theta_k}(\theta)$ to clip the importance sampling ratio $r_t(\theta)$ to stay within $[1-\epsilon, 1+\epsilon]$ , where:

r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)}

The clipped objective function is then:

\mathcal L_{\theta_k}^{CLIP}(\theta) = \mathop\mathbb E_{\tau\sim\pi_k}\Big[ \sum_{t=0}^T \min\Big( r_t(\theta)A^{\pi_k}_t, clip(r_t(\theta), 1-\epsilon,1+\epsilon)A^{\pi_k}_t \Big) \Big]

This clipping has an intuitive effect:

When the advantage $A^{\pi_k}_t$ is positive, the ratio is clipped at $1+\epsilon$ , preventing the new policy from moving too far in that direction
When the advantage is negative, the ratio is clipped at $1-\epsilon$ , again limiting how much the policy can change

Like the adaptive KL approach, the policy update is simply:

$\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}^{CLIP}(\theta)$

This clipped objective provides similar benefits to TRPO’s trust region constraint and the adaptive KL penalty, but with even simpler implementation since we only need to clip a ratio rather than compute KL divergences or adjust penalty coefficients.

Algorithm

Collect partial trajectories on the policy $\pi_k = \pi(\theta_k)$
Estimate advantages $A_t^{\pi_k}$ using any advantage estimation algorithm
Compute policy update as
1. $\theta_{k+1} = \argmax_\theta \mathcal L_{\theta_k}^{CLIP}(\theta)$
2. by taking K steps of minibatch SGD using the Adam optimizer:
  - Sample a minibatch of transitions $(s_t, a_t, A_t)$ from the collected trajectories
  - Compute the clipped objective $\mathcal L_{\theta_k}^{CLIP}(\theta)$ for this minibatch
  - Update $\theta$ using Adam to maximize $\mathcal L_{\theta_k}^{CLIP}(\theta)$
  - Repeat for K iterations
  - K is typically between 3-10 epochs over the entire dataset

Modern Q-Learning Approaches

Modern Q-learning approaches have built upon the foundational DQN algorithm by incorporating key insights from policy gradient methods. These improvements address several limitations of vanilla DQN:

Sample efficiency - DQN requires many environment interactions to learn
Value overestimation - The max operator in Q-learning can lead to systematic overestimation
Exploration - Simple ε-greedy exploration doesn’t scale well to complex environments
Offline learning - DQN struggles to learn from static datasets without environment interaction

Recent algorithms like SAC (Soft Actor-Critic) and CQL (Conservative Q-Learning) have made significant progress on these challenges by:

Using entropy regularization to encourage exploration and prevent premature convergence
Employing conservative value estimates to avoid overestimation
Leveraging techniques from policy gradients like importance sampling and trust regions
Enabling robust offline learning from static datasets

Let’s examine some of these modern approaches and how they build upon DQN while incorporating lessons from policy optimization.

Conservative Q-Learning

Offline RL Basics

In offline RL techniques, we learn a policy $\pi$ that maximize their expected future discounted rewards $\eta(\pi)$ not by interacting with their environment and receiving rewards but by learning from a static dataset $\mathcal D$ of transitions $(s,a,s',r)$ which were generated by a different behaviour policy $\pi_\beta$ and its marginal state distribution $d^{\pi_\beta}(s)$

(this behaviour policy can be another NN, or an expert policy, a human demonstrator etc)

Since $\mathcal D$ does not contain all possible transitions $(s,a,s')$ the algorithms use TD(1) methods (i.e backing up a single sample) to learn $Q_\theta(s,a)$

TD(1) is used here because we can't compute expectations over all possible next states and actions - we only have the transitions that exist in our dataset. So instead of trying to estimate the true expected value, we just use the single observed next state and reward as our target. This makes the learning more stable given the limited data.

\hat Q_{k+1} = \argmin_Q \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[r(s,a) + \gamma\mathop\mathbb E_{a'\sim\hat\pi_k(a'|s')}[\hat Q_k(s',a')] -Q(s,a)\Big]^2 \newline \hat \pi_{k+1} = \argmax_\pi \mathop\mathbb E_{s\sim\mathcal D,a\sim\pi_k(a|s)} \Big[\hat Q_{k+1}(s,a)\Big]

Read Levine, 2020 for a comprehensive overview of offline RL.

Bellman Operators and Q-Learning

The foundation of Q-learning lies in the concept of Bellman operators, which provide a mathematical framework for iteratively improving value estimates. Let’s build this up step by step.

The Optimal Bellman Operator

The optimal Bellman operator $\mathcal B^*$ transforms one Q-function into another, and when applied repeatedly, converges to the optimal Q-function $Q^*$ (Bellman, 1957):

\mathcal B^*Q(s,a) = r(s,a) + \gamma\mathop\mathbb E_{s'\sim P(s'|s,a)} [\max_{a'} Q(s',a')]

Think of the Bellman operator as a "improvement machine" - you put in a Q-function, and it returns a better one. The optimal operator always chooses the best possible next action (that's what the max does), assuming perfect knowledge of state transitions.

The Policy Bellman Operator

In practice, we often work with a specific policy $\pi$ rather than always taking the maximum action. This gives us the policy Bellman operator $\mathcal B^\pi$ (Sutton & Barto, 2018):

\mathcal B^{\pi}Q(s,a) = r(s,a) + \gamma\mathop\mathbb E_{s'\sim T(s'|s,a),\ a'\sim\pi(a'|s')} [Q(s',a')]

Here, $T(s'|s,a)$ represents the true environment dynamics - how states transition given actions. However, we typically don’t have access to these true dynamics.

Empirical Bellman Operators

In real-world settings, we work with a finite dataset $\mathcal D$ of transitions $(s,a,r,s')$ . This leads us to use an empirical Bellman operator that estimates values from samples (see Kumar et al, 2020 and Haskell et al, 2013)

\hat {\mathcal B}^\pi \hat{Q}_k(s,a) = r(s,a) + \gamma\mathop\mathbb E_{a'\sim\hat\pi_k(a'|s')}[\hat Q_k(s',a')]

This allows us to write the practical Q-learning update as:

\hat Q_{k+1} = \argmin_Q \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\hat{\mathcal B}^{\pi_k} Q_k -Q(s,a)\Big]^2

Challenges in Offline Reinforcement Learning

Offline RL faces several fundamental challenges that make it significantly more difficult than online RL (Levine et al., 2020):

1. Action Distribution Shift

When learning from a fixed dataset, the learned policy $\pi_k$ may attempt to take actions that were never seen in the training data. This creates an action distribution shift where:

The Q-function must estimate values for unseen state-action pairs
These estimates are often unreliable since they’re based on extrapolation
The policy might exploit these overoptimistic estimates

Imagine trying to learn to cook only from watching cooking shows, without ever touching ingredients yourself. You might think certain techniques are easier than they really are because you can't verify your assumptions through practice.

2. Compounding Errors

Unlike online RL where the policy can correct its mistakes through environment interaction (Kumar et al., 2019):

Errors in Q-value estimates compound over time
The policy can’t verify its decisions through exploration
This often leads to systematic overestimation of Q-values

3. Limited State Coverage

The fixed dataset may not cover important parts of the state space (Fu et al., 2020):

Some critical states might be missing entirely
The distribution of states might be skewed
The policy needs to handle out-of-distribution states gracefully

Conservative Value Estimation

(see Kumar et al, 2020)

The key insight of Conservative Q-Learning (CQL) is that Q-functions learned from offline data tend to be overoptimistic about unseen actions. To address this, CQL introduces a conservative value estimation approach.

Distribution Matching

Let’s define two key distributions:

$d^{\pi_\beta}(s)$ : The state distribution from our offline dataset (generated by behavior policy $\pi_\beta$ )
$\mu(a|s)$ : A new policy that we’ll use to lower-bound our Q-values

Together, these form the joint distribution $\mu(s,a) = d^{\pi_\beta}(s)\mu(a|s)$ that we’ll use for conservative estimation. The constraint $\text{supp}(\mu)\subset\text{supp}(\pi_\beta)$ ensures we only consider actions that exist in our dataset.

Conservative Q Update

The first CQL update equation adds a conservative term to the standard Bellman update:

\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha\mathop\mathbb E_{s\sim\mathcal D,a\sim\mu}[Q(s,a)]}_{\text{conservative term}}+ \underbrace{\frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2}_{\text{standard Bellman error}} \Bigg]

This equation has two parts:

The standard Bellman error that drives Q-values toward their TD targets
A new conservative term that pushes down Q-values under distribution $\mu$

The conservative term acts as a regularizer - it prevents Q-values from becoming too large for actions we haven’t seen in our dataset. The hyperparameter $\alpha$ controls how conservative we want to be.

Tightening the Lower Bound

While pushing down all Q-values helps prevent overoptimism, it might be too conservative. We can improve this by simultaneously maximizing Q-values for actions we’ve actually seen in our dataset:

\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha \Big[\mathop\mathbb E_{s\sim\mathcal D,a\sim\mu}[Q(s,a)]- \mathop\mathbb E_{s,a\sim\mathcal D}[Q(s,a)] \Big]}_{\text{relative conservative term}} \newline+ \frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2 \Bigg]

This modified objective:

Still pushes down Q-values for out-of-distribution actions (first expectation)
But pushes up Q-values for in-distribution actions (second expectation)
The difference between these terms creates a “gap” between seen and unseen actions

This relative conservative term helps maintain a meaningful ordering of Q-values while still preventing overoptimism about unseen actions. It’s like saying “make unseen actions look worse than the actions we know about” rather than just “make all actions look bad.”

The 1/2 term is a scaling factor commonly used in squared error loss functions because when we take the derivative, it cancels with the 2 from the power rule to give cleaner gradients. While this doesn't change which parameters minimize the loss, it makes the optimization process more numerically stable.

Conservative Q-Learning Algorithm

CQL faces a key design choice: how should we use our conservative Q-values for policy optimization? There are two main approaches:

Full Policy Iteration: We could alternate between:
- Computing conservative Q-values for the current policy $\hat{\pi}^k$ , and Performing one step of policy improvement
- However, this approach is computationally expensive.
Online Algorithm: Since our policy $\hat{\pi}^k$ is typically derived from the Q-function anyway, we can make the algorithm more efficient by:
- Choosing $\mu(a|s)$ to approximate the policy that would maximize the current Q-function iterate
- This gives us an “online” algorithm where Q-value estimation and policy improvement happen simultaneously

The authors implement the second approach by solving a min-max optimization problem:

\hat Q_{k+1} = \argmin_Q \Bigg[ \underbrace{\alpha \Big[\max_{\mu} \mathop\mathbb E_{s\sim\mathcal D,a\sim \mu}[Q(s,a)]- \mathop\mathbb E_{s,a\sim\mathcal D}[Q(s,a)] \Big]}_{\text{worst-case conservative estimate}} \newline+ \underbrace{\frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2}_{\text{standard Bellman error}} + \underbrace{\mathcal R(\mu)}_{\text{regularizer}} \Bigg]

The $\max_{\mu}$ term actively searches for actions that might be overvalued, making our conservative estimate more robust. The regularizer $\mathcal R(\mu)$ prevents this search from becoming too aggressive. This formulation gives rise to a family of CQL variants, denoted as CQL( $\mathcal{R}$ ), each characterized by a different choice of regularizer.

CQL(H): A Practical Variant

CQL(H) is one of two important variants in the CQL family, demonstrating how different choices of regularizer $\mathcal{R}(\mu)$ lead to different practical algorithms.

In CQL(H), we choose:

The regularizer to be the negative KL-divergence against a prior: $\mathcal{R}(\mu) = -D_{KL}(\mu || \rho)$
The prior $\rho(a|s)$ to be a uniform distribution over actions

This choice leads to two key results:

The optimal $\mu$ has a closed-form solution (see Appendix A of the paper): $\mu^*(a|s) \propto \rho(a|s) \cdot \exp(Q(s,a)) \propto \exp(Q(s,a))$ (The second proportionality follows because $\rho$ is uniform)
When we plug this optimal $\mu^*$ back into the objective, the first term becomes a soft maximum over Q-values:

\hat Q_{k+1} = \argmin_Q \Bigg[ \alpha \mathop\mathbb E_{s\sim\mathcal D} \Big[ \underbrace{\log\sum_a \exp(Q(s,a))}_{\text{soft maximum}} - \underbrace{\mathop\mathbb E_{a\sim\hat\pi_\beta(a|s)}[Q(s,a)]}_{\text{dataset values}} \Big] \newline+ \frac{1}{2} \mathop\mathbb E_{s,a,s'\sim \mathcal D} \Big[\mathcal B^{\pi_k} Q_k -Q(s,a)\Big]^2 \Bigg]

The soft maximum (or "logsumexp") is always greater than or equal to the true maximum, making our conservative estimates even more conservative. This aligns with CQL's goal of preventing overoptimism about unseen actions.

There’s also a second variant where $\rho(a|s)$ is set to the previous policy $\hat{\pi}^{k-1}$ , which results in an exponentially weighted average of Q-values from actions sampled from $\hat{\pi}^{k-1}(a|s)$ .