An Introduction to Reinforcement Learning
A comprehensive guide to modern Reinforcement Learning, from MDPs, through classic algorithms and Policy Gradients to Offline RL.
Published on
.
Goal
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward
Modeling and Terminology
Basic reinforcement learning is modelled as a Markov decision process (MDP):
- a set of environment and agent states,
- a set of actions, , of the agent
- is the probability of transition (at time ) from states under action
- is the immediate reward after transition from under action
The agent’s action selection is modelled as a map called a policy function where . The policy map gives the probability of taking action in state . The policy can alternatively be deterministic maps from to .
The goal of a reinforcement learning agent is to learn an optimal policy which maximises the expected cumulative reward.
The value function (or state-value function) is defined as the expected return starting with state and successively following policy . Hence, roughly speaking, the value function estimates “how good” it is to be in a given state.
Value of a trajectory or policy at a state is the expected total reward from the state to the final destination; Here is the discount rate of future rewards
Total reward is where
Given a state , action and policy , the action-value function of the pair under policy is the Q-function defined as
It is the expected total reward if we take action in state .
MDP Formulation
RL problems are formulated as an infinite-horizon discounted Markov decision process (MDP), defined by the tuple , where is a finite set of states, is a finite set of actions, is the transition probability distribution is the reward function, is the distribution of the initial state , and is the discount factor.
Let denote a stochastic policy , and let denote its expected discounted reward.
Goal (Criterion of optimality)
The optimal policy is defined in terms of the state-value function as
i.e the optimal policy maximises the total reward possible from state
MDP Algorithms
Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Both algorithms compute a sequence of functions that converge to .
Computing these functions involves computing expectations over the whole state-space, which is impractical for all but the smallest (finite) MDPs. Instead, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces.
Both algorithms are based on the Bellman optimality equation.
Bellman Optimality
The Bellman optimality equation expresses the relationship between the value of a state and the values of its successor states under optimal behavior. It states that the value of a state under an optimal policy must equal the expected return for the best action from that state.
For the state-value function, the Bellman optimality equation is:
And for the action-value function:
The Bellman optimality equation is particularly important because its solution gives us the optimal value function, from which we can derive the optimal policy. Once we have , the optimal policy is to select actions that maximize the expected value:
Policy Iteration
(1) Policy evaluation - compute the value function for the current policy, and
(2) Policy improvement - update the policy to choose actions that maximize value based on the new value function.
In policy iteration (Howard 1960), the state-value function is updated once, and then the policy is updated, then both are repeated until policy converges.
The algorithm starts at with a guess of initial value and policy. The steps are repeated until the policy stops changing.
Value Iteration
In value iteration (Bellman 1957), which is also called backward induction, the policy is not used.
Here, is the iteration number. Value iteration starts at with a guess and iteratively applies the equation until LHS and RHS converges.
Stochastic Model-free Algorithms
When the complete model (MDP) of the environment is not available, the agent can learn from experience (i.e by interacting with the environment).
Monte Carlo Value Iteration
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. Monte Carlo methods sample and average returns for each state-action pair and average rewards for each action.
They are typically applied to episodic tasks. The policy is updated only at the end of an episode.
(Note: Episodic Tasks are ones where the task terminates after some point. A game is an example)
Temporal Difference Learning
TD methods only wait until the next time step to update the value estimates.
The update is done by transitioning from , collecting the reward and then correcting using it.
The difference between expected reward and observed reward is fed back to and is called the “temporal difference (TD) error”.
After the optimal is learnt, the policy is determined by either with a greedy policy where the action taken at any state maximises the total reward. Or with an ɛ-greedy policy where the best action is taken with probability ɛ and a random action with probability 1-ɛ
Forms of TD Error
Temporal difference can have many steps of rewards before using the estimated value .
- TD(1) error :
- TD(2) error :
- TD(k) error :
Q-Learning
Q-learning is similar TD-learning but the action-value (Q) function is updated instead of state-value (V) function.
Deep Q-Learning
This is the same as Q-learning but the Q-function is approximated using a neural network where are the weights of the network.
The loss function is called “mean squared TD error” (Sutton & Barto, 2018, p. 199) and is computed by squaring the TD-error.
This loss function causes divergence because states are correlated, and rewards are non-stationary (for e.g, reward for same state changes over time).
To deal with correlated states, the agent builds a dataset of experience and then takes random samples from the dataset, to break the correlation. This is called Experience Replay (Mnih et al., 2013).
The agent fixes the parameters and then periodically updates the fixed parameters (i.e on every K-th iteration).
This can be improved by prioritising the states picked up from the experience dataset, using 2 different networks for and , etc.
Regular experience replay can struggle with sparse rewards where successful experiences are rare. For example, a robot might try thousands of times to grab an object without succeeding, giving no useful learning signal.
Hindsight Experience Replay (Andrychowicz et al., 2017) addresses this by treating failures as successes for a different goal.
When the agent fails its intended goal, HER saves both the original experience and a modified version where the achieved outcome becomes the goal. This lets the agent learn from every attempt, even failed ones. The approach works by augmenting the experience tuple with additional versions using achieved goals from the same episode.
Policy Gradients
Here we learn the policy directly instead of learning V or Q functions and deriving a policy from them. The policy is approximated with a neural network and the parameters are updated using policy gradients.
The expected cumulative reward for a policy with parameters is given by
The best policy is found by doing gradient ascent on the cumulative reward i.e find the resulting in the largest cumulative reward.
Here, is a trajectory of pairs, and is the distribution of trajectories defined by the policy with parameters .
The parameters are then updated as .
Since the gradient of involves differentiating products, we use logarithms to simplify. This is done using the policy gradient theorem which uses the following identity.
With this, the J-function’s gradient is computed as
See Lilian Weng, 2018 for a detailed derivation.
The gradient of J(w) is represented as an expectation of a process, and we can sample from the process to approximate the distribution (i.e Monte Carlo).
The log-policy can be expanded as (assuming a discrete infinite state space)
The derivative of the log-policy is then
Substituting the reward function as a sum, limiting the time horizon to T, and expanding the expectation as mean over N iterations, the loss function becomes
The complete vanilla policy gradient algorithm is
- Sample trajectories from , each with a maximum duration of T
- Compute the loss function with existing policy neural network
- Back-propagate the loss into the policy neural network with an optimiser
See Sutton et al, 2000.
Note: The total reward is now represented in an un-discounted way. This is discussed later.
The vanilla policy gradient algorithm has a major issue: it updates the policy based on the total reward of the trajectory, which means actions early in the trajectory are credited/blamed for rewards that came after them. This is inefficient and can lead to incorrect learning.
REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) solves this by using temporal causality - each action is only updated based on the rewards that came after it. The algorithm is:
- Sample trajectory from policy
- For each timestep t:
- Compute return (only future rewards)
- Update policy:
This gives a more focused learning signal for each action based only on its consequences.
See Sutton et al, 2000 and Williams, 1992 for more details.
Benefit of log-policy over naive policy
The conversion of a policy to the log-policy prevents the loss from exploding, or going to zero when the states highly correlated (which causes gradients to be large or small).
High Variance of Policy Gradients
The policy gradient algorithm suffers from high variance because:
- The stochastic policy can generate different actions for the same states, leading to different trajectories
- These different trajectories can result in vastly different total rewards
- This variance in rewards creates noisy gradient estimates, making it difficult for the algorithm to converge to an optimum
While Monte Carlo sampling in policy gradients is unbiased (meaning it gives the correct gradient in expectation), its high variance means individual gradient estimates can point in conflicting directions.
Solutions to reduce variance:
-
Larger batch sizes: Averaging over more trajectories reduces variance in the gradient estimate. However, this comes at the cost of reduced sample efficiency since more environment interactions are needed per update.
-
Baseline subtraction: Introduce a baseline (typically the state-value function ) to center the rewards. This changes the objective from tracking absolute rewards to tracking relative advantages:
- Original:
- With baseline:
The baseline reduces variance by normalizing the scale of rewards. To understand why:
- Consider two actions in a state:
- Action A: Usually gives rewards of either 90 or 110
- Action B: Usually gives rewards of either -10 or 10
- Without a baseline:
- For action A, the policy gradient will use large values (90 or 110)
- For action B, the policy gradient will use smaller values (-10 or 10)
- These different scales make learning difficult and increase variance
- With a baseline V(s) = 100 for action A's state and V(s) = 0 for action B's state:
- Action A's advantages become: -10 or +10 (from 90-100 or 110-100)
- Action B's advantages remain: -10 or +10
- Now both actions use similar scales, reducing variance
The baseline doesn't change which action is better (their relative ordering stays the same), but it makes the learning signal more consistent by removing the effect of different reward scales across states.
The baseline reduces variance without introducing bias, since .
Advantage of an action in a state over a baseline
We can add or subtract a term to the optimisation problem as long as the term is not related to the parameters .
The first term of the reward sum is the action-value function and the second term is a baseline term. Their difference is called the advantage of an action in state , when using policy , over the baseline
The most common baseline is the state-value function which represents the total reward expected at state when following policy . So
With a baseline, the updated algorithm is
- Sample trajectories from , each with a maximum duration of T
- At each time-step in each trajectory, compute total reward and advantage estimate
- The total reward can also be discounted
- Re-fit the baseline by minimising the error over all trajectories and time steps
- Compute the new loss function
- Back-propagate the loss into the policy neural network with an optimiser
See Schulman et al, 2016 for more details.
Reward variations
- Causal Rewards: The loss function can be updated to make the reward calculation causal, by computing rewards from instead of from
- Discounted Rewards: Reward discount () reduces variance by reducing the impact of distant actions. This is also strictly necessary to interpret the loss function as a true gradient.
General Policy Gradient Loss
Schulman et al, 2016 explains that policy gradient methods maximise the expected total reward by repeatedly estimating the gradient . There are several different related expressions for the policy gradient, which have the form.
Where may be one of the following
- : total reward of the trajectory
- : total reward following action
- : total reward following w.r.t a baseline
- : state-action value function
- : advantage function
- : One-step TD residual
Problems with Policy Gradients
An analogy might help understand these challenges:
Imagine teaching someone to cook by giving them recipe feedback. Policy Gradients is like:
- Having to throw away all previous cooking notes after each attempt (sample efficiency)
- Tiny adjustments to ingredient amounts sometimes completely ruining the dish (parameter space mismatch)
- Not properly considering how early cooking steps affect the final taste (gradient estimation)
This is why modern RL often uses more sophisticated methods that build guardrails around these issues, similar to how cooking schools provide structured learning environments rather than pure trial and error.
There are three main problems with policy gradients:
-
Poor Sample Efficiency
- Each policy update invalidates previously collected data since the policy has changed
- Unlike Deep Q-Networks where old experiences remain useful, PG methods require fresh data for each update
- This is because PG uses on-policy expectations - we can only use data collected from the current policy
-
Parameter-Policy Space Mismatch
- The parameter space () and policy space () are different manifolds with different metrics, meaning distances/lengths in one space don’t correspond to equivalent distances in the other
- This means tiny parameter updates can cause unexpectedly large policy changes, making optimization unstable, and making it difficult to control how much the policy actually changes during updates
- The disconnect between parameter updates and policy behavior makes finding appropriate learning rates challenging
-
Gradient Estimation Issues
- The policy gradient estimator doesn’t properly incorporate discounted rewards
- This leads to biased gradient estimates that can harm convergence
- The bias comes from treating all time steps equally instead of properly weighting future rewards
See the callout for more details on these issues.
Let's break down the math behind issues 1 and 2:
1. Sample Efficiency Issue
When we collect data using policy π₁ and then update to π₂, we need to compare how likely each trajectory is under both policies. This is done using the probability ratio:
P(τ|π₂)/P(τ|π₁) = ∏ᵢ π₂(aᵢ|sᵢ)/π₁(aᵢ|sᵢ)
This ratio represents how much more/less likely the trajectory τ is under the new policy π₂ compared to the old policy π₁. Since it's a product of many terms (one for each timestep), even small differences between π₁ and π₂ get amplified exponentially, making old data unusable.
2. Parameter-Policy Space Mismatch
The parameter space θ and policy space π have fundamentally different distance metrics. In parameter space, we use Euclidean distance ||θ₁ - θ₂||, while in policy space we measure distances using KL divergence KL(πθ₁||πθ₂). These metrics don't align:
KL(πθ₁||πθ₂) ≠ c||θ₁ - θ₂||
where KL is Kullback-Leibler divergence and c is any constant. This means a small step in parameter space can cause a large change in policy space. The relationship is highly non-linear and makes optimization unstable.
Importance Sampling
Basics
Sample efficiency can be improved by taking trajectories from older policies and changing the loss function to account for the difference.
Importance sampling is a technique for estimating expectation of a function using samples drawn from a different distribution .
The ratio is the importance sampling weight for x. Let be our importance sampling estimator - the average of over N samples drawn from Q. The variance of this estimator is:
The first term can be a major problem because it involves the ratio P(x)/Q(x). If Q(x) is small in regions where P(x)f(x)² is large, this ratio explodes and causes extremely high variance. This insight leads to several key developments in later sections:
- In policy gradients, this manifests as the ratio of policy probabilities becoming unstable when policies diverge too much
- Trust region methods like TRPO address this by explicitly constraining how much policies can change
- PPO simplifies this by clipping the ratio to a fixed range, providing a more practical solution
This variance issue is fundamental to why naive policy gradient methods can be unstable and why constraining policy updates becomes so important.
For a detailed treatment of importance sampling, see Owen, 2013 which covers both the theory and practical considerations.
Usage in Policy Gradients
Using the importance sampling technique described above, we can improve the efficiency of policy gradient methods. Instead of sampling trajectories from the current policy being optimized, we can reuse trajectories from an older policy by applying importance sampling weights:
However, this introduces a new challenge. Each term is a product of state transition and policy probabilities: . Due to this product structure, even small differences between policies can cause the importance sampling weight to become extremely large or small, leading to the high variance problem discussed earlier. This variance issue will be addressed by Natural Policy Gradient (NPG) and subsequent methods.
Monotonic Improvement Algorithms
Monotonic Improvement Algorithms are a class of policy optimization methods that guarantee each policy update will improve (or at least not decrease) the expected reward. This directly addresses the variance issues we saw with importance sampling in policy gradients, where large policy changes could lead to unstable updates.
The key idea is to bound policy changes while still maximizing improvement. By carefully controlling how much policies can diverge from each other, these methods avoid the exploding importance sampling ratios we discussed earlier.
Relative Policy Performance Identity
For two policies and , the relative performance can be computed as (see Schulman et al, 2017 Appendix A for proof)
We can use this identity to sample from old policy and use it to improve the current one .
This still requires trajectories sampled from . To fix this, first rewrite the expectation in terms of states and actions (see the Ergodic Theorem).
where is the discounted future state distribution (normalized to 1) and is defined as
The discounted future state distribution dπ(s) represents the normalized discounted sum of probabilities of being in state s at each timestep t when following policy π. The (1-γ) term normalizes the infinite sum to create a proper probability distribution.
While similar in concept, this is distinct from the stationary distribution of the Markov chain induced by π. The stationary distribution, if it exists, is the limiting distribution as t→∞, while dπ(s) explicitly weights earlier timesteps more heavily through the discount factor.
The state visitation frequency ρπ(s) measures how often a policy visits each state in practice. For infinite horizon problems with γ=1, dπ(s) converges to ρπ(s), making it a useful theoretical bridge between discounted and average reward settings.
This equation still has states sampled from which can be fixed as
Where is the lower bound or local approximation of the relative performance.
It can be proven that the residual error has an upper bound related to the KL-divergence between the two policies. See Constrained Policy Optimization; Achiam et al, 2017 for proof. So if the divergence is very small, this approximation is very accurate.
Majorize-Minimization
Since is a lower bound on the relative performance between and , we can use it as a surrogate objective, instead of the main objective. Maximising it will always maximise the true objective. See MM algorithm
The Majorize-Minimization (MM) algorithm is a general optimization technique that works by iteratively constructing and optimizing surrogate functions.
At each step, it creates a surrogate function that "majorizes" (upper bounds) the objective function and touches it at the current point. By minimizing this surrogate, we're guaranteed to improve or maintain the original objective.
The key idea is that the surrogate should be easier to optimize than the original function while still preserving the essential properties that ensure convergence. This approach is particularly useful when the original objective is difficult to optimize directly.
Here, both and the KL-divergence can be estimated from samples of the policy
Note: The value of C provided in the paper can be high if is near 1. Their recommendation is to use a KL constraint instead of KL-penalty implied in the original equation.
Constraint Equation
The is called a trust region (see Trust region)
This algorithm fixes the two failure modes of sample efficiency and incorrect length measurements.
Natural Policy Gradient
While the policy gradient methods above work, they can be inefficient since they don’t account for the geometry of the policy space. Natural Policy Gradient (NPG) addresses this by using information geometry to find better update directions (see A Natural Policy Gradient).
Metric Tensor Basics
A metric tensor defines how we measure distances in a space. In the context of policy optimization, this is crucial because it helps us measure how “different” two policies are.
The squared distance between nearby points and is given by:
In the simplest case (orthogonal coordinates), is just the identity matrix, giving us the familiar Euclidean distance:
When we change coordinate systems from to , the relationship between small changes in each system is:
We can write this more compactly using the Jacobian matrix where :
The metric tensors in the two coordinate systems ( and ) are related by:
Similarly, gradients transform as:
The natural gradient direction in each coordinate system ( and ) are related by:
This last equation shows that transforms properly under coordinate changes - it’s a covariant vector. This is why natural gradients are “natural”: they give us update directions that are invariant to how we parameterize our policy.
Metric Tensor for Probability Spaces
(see Amari, S. (1998))
When optimizing policies, we want to find the direction that improves our objective most efficiently. This is the direction of steepest descent - the vector that minimizes while keeping the step size constant.
If we measure distances using a metric tensor , then the steepest descent direction is . Crucially, this metric should depend on the underlying probability space our policy operates in, not on how we happened to parameterize it.
The Fisher Information Matrix measures how much a probability distribution changes when we adjust its parameters. It characterizes the local curvature of the probability space by measuring changes in the log-likelihood – regions where small parameter changes cause large changes in log-likelihood will have high Fisher Information.
For our policy optimization, Fs(θ) measures how sensitive our policy's action distribution is to parameter changes in state s. Small changes to parameters that dramatically alter the policy's log-likelihood in a state will result in high Fs(θ) values for that state.
The full metric G averages these state-specific sensitivities according to how often we visit each state. This means we're naturally more careful with parameters that affect frequently-visited states, helping prevent destabilizing changes to critical parts of our policy.
The Fisher Information Matrix provides the appropriate metric tensor for our optimization, as it captures the local geometry of the probability space. For each state , our policy defines a probability distribution over actions, parameterized by . The Fisher Information Matrix for this distribution is:
Since our policy operates over many states, we average these matrices according to how often we visit each state, giving us our final metric tensor:
where is the stationary state distribution under policy .
Changes needed
The update rule for the network parameters is where is the policy gradient’s estimator.
The first change is to use the gradient with a proper metric i.e
The second change is to compute to satisfy the KL-divergence constraint specified in the importance sampling section (see constraint equation).
Derivation of metric and coefficient
The lower-bound and KL-divergence equations are expanded via Taylor series as
where is the proposed update of parameters.
Here, the first part of both expansions are zero. The first-derivative term of the KL-divergence expansion is also zero as shown in Slide 33 of Lecture 13: Natural Policy Gradients
The second-derivative part simplifies to which is the Fischer information matrix.
The best that we can get is found by solving the following optimization problem which is equivalent to the constrained problem specified earlier.
( became by a multiplication of throughout)
The term inside the can be differentiated w.r.t and set to 0 to find the minimum.
The update rule is now
To derive the learning rate , we simplify the KL-divergence expansion and substitute . Since F is symmetric, and
Algorithm
- Collect trajectories on the policy
- Estimate advantages using any advantage estimation algorithm
- Form sample estimates for
- Policy gradient using the advantage estimates
- Fisher information (or KL-divergence hessian)
- Compute the parameter update as follows and backpropagate
Truncated NPG
While Natural Policy Gradient provides theoretically sound updates, it faces a practical computational challenge: the Fisher Information Matrix grows quadratically with the number of parameters . Specifically:
- The matrix has size (or elements) as it is the Hessian of the KL-divergence
- Computing its inverse has complexity
- For modern neural networks where can be millions, this is computationally infeasible
To make NPG practical, we can avoid explicitly computing by using the conjugate gradient algorithm. Instead of directly computing , we reformulate the problem as solving the linear system:
The conjugate gradient method is an iterative algorithm for solving linear systems Ax=b where A is symmetric and positive definite. Rather than computing A⁻¹ directly, it:
- Starts with an initial guess x₀
- Iteratively improves the solution by moving along carefully chosen directions
- These directions are "conjugate" (A-orthogonal) to each other, ensuring efficient convergence
- Often finds good approximate solutions in far fewer than n iterations
Think of conjugate gradient as finding the valley floor of a quadratic bowl (Fx=g) by taking clever steps that are always perpendicular to previous steps with respect to the bowl's shape (F). This is much more efficient than trying to map out the entire bowl (computing F⁻¹).
The conjugate gradient method is particularly efficient because:
- We never need to store explicitly - we only need to compute matrix-vector products
- These products can be computed efficiently using automatic differentiation
- We can truncate the iteration early once we have a “good enough” solution
- The algorithm naturally exploits any sparsity or structure in
This “truncated” version of NPG maintains most of the benefits of natural gradient descent while being computationally tractable for large neural networks. See Lecture 5: Advanced Policy Gradient Methods for more details.
Trust Region Policy Optimization
While NPG provides a theoretically sound way to compute update directions, there’s still a practical challenge: choosing the right step size. The trust region parameter that we derived earlier might be:
- Too large, causing the policy to change too drastically
- Too small, leading to slow learning
- Inappropriate due to the quadratic approximation breaking down
To address this, TRPO uses line search - an adaptive way to scale the update. The idea is to try progressively smaller steps until we find one that satisfies our constraints and improves the policy. See Schulman et al. (2015) for more details.
TRPO combines two key ideas:
- Natural gradient for computing good update directions
- Line search for finding the largest safe step size
This makes it more robust than basic policy gradient methods while still being computationally tractable.
Line Search Algorithm
- Compute the proposed step direction (as we’ll see below)
- For to :
- Try the update:
- If both conditions are met:
- Policy improves:
- Within trust region:
- Then accept:
- Otherwise continue with smaller step ( decreases exponentially)
Complete TRPO Algorithm
-
Sample Collection:
- Collect trajectories using current policy
-
Advantage Estimation:
- Estimate advantages for each state-action pair
-
Gradient Computation:
- Compute policy gradient
- Estimate Fisher information matrix
-
Natural Gradient:
- Use conjugate gradient (with iterations) to compute
-
Step Sizing:
- Compute the proposed step:
-
Line Search:
- Use backtracking line search to find final update:
Proximal Policy Optimization
Instead of using the natural gradient and line search to exactly enforce the KL-divergence constraint, PPO uses simpler methods to approximately enforce this constraint. This achieves similar benefits to TRPO’s trust region optimization but with less computational overhead.
There are two variants
Adaptive KL Penalty
The first PPO variant uses an adaptive penalty coefficient to softly enforce the KL-divergence constraint. Instead of explicitly constraining how much the policy can change (as in TRPO), it adds a penalty term to the objective:
-
The policy update solves an unconstrained optimization problem:
Here, controls how strongly we penalize changes to the policy - higher values of mean smaller policy updates.
-
After each update, we adjust based on whether the actual KL-divergence was too large or too small:
- If the policy changed too much (KL too high), we increase to enforce smaller updates
- If the policy changed too little (KL too low), we decrease to allow larger updates
This adaptive scheme maintains the benefits of trust region optimization while being simpler to implement than TRPO’s exact constraint enforcement.
Algorithm
- Collect partial trajectories on the policy
- Estimate advantages using any advantage estimation algorithm
- Compute policy update as
- by taking K steps of minibatch SGD (via Adam)
- if then
- if then
Clipped objective function
The second PPO variant takes an even simpler approach to constraining policy updates. Rather than using an adaptive KL penalty, it directly clips the importance sampling ratio to prevent large policy changes.
Recall that in the previous approach, we used a KL penalty term to softly constrain how much the policy could change. Here, we instead modify the objective function to clip the importance sampling ratio to stay within , where:
The clipped objective function is then:
This clipping has an intuitive effect:
- When the advantage is positive, the ratio is clipped at , preventing the new policy from moving too far in that direction
- When the advantage is negative, the ratio is clipped at , again limiting how much the policy can change
Like the adaptive KL approach, the policy update is simply:
This clipped objective provides similar benefits to TRPO’s trust region constraint and the adaptive KL penalty, but with even simpler implementation since we only need to clip a ratio rather than compute KL divergences or adjust penalty coefficients.
Algorithm
- Collect partial trajectories on the policy
- Estimate advantages using any advantage estimation algorithm
- Compute policy update as
- by taking K steps of minibatch SGD using the Adam optimizer:
- Sample a minibatch of transitions from the collected trajectories
- Compute the clipped objective for this minibatch
- Update using Adam to maximize
- Repeat for K iterations
- K is typically between 3-10 epochs over the entire dataset
Modern Q-Learning Approaches
Modern Q-learning approaches have built upon the foundational DQN algorithm by incorporating key insights from policy gradient methods. These improvements address several limitations of vanilla DQN:
- Sample efficiency - DQN requires many environment interactions to learn
- Value overestimation - The max operator in Q-learning can lead to systematic overestimation
- Exploration - Simple ε-greedy exploration doesn’t scale well to complex environments
- Offline learning - DQN struggles to learn from static datasets without environment interaction
Recent algorithms like SAC (Soft Actor-Critic) and CQL (Conservative Q-Learning) have made significant progress on these challenges by:
- Using entropy regularization to encourage exploration and prevent premature convergence
- Employing conservative value estimates to avoid overestimation
- Leveraging techniques from policy gradients like importance sampling and trust regions
- Enabling robust offline learning from static datasets
Let’s examine some of these modern approaches and how they build upon DQN while incorporating lessons from policy optimization.
Conservative Q-Learning
Offline RL Basics
In offline RL techniques, we learn a policy that maximize their expected future discounted rewards not by interacting with their environment and receiving rewards but by learning from a static dataset of transitions which were generated by a different behaviour policy and its marginal state distribution
(this behaviour policy can be another NN, or an expert policy, a human demonstrator etc)
Since does not contain all possible transitions the algorithms use TD(1) methods (i.e backing up a single sample) to learn
Read Levine, 2020 for a comprehensive overview of offline RL.
Bellman Operators and Q-Learning
The foundation of Q-learning lies in the concept of Bellman operators, which provide a mathematical framework for iteratively improving value estimates. Let’s build this up step by step.
The Optimal Bellman Operator
The optimal Bellman operator transforms one Q-function into another, and when applied repeatedly, converges to the optimal Q-function (Bellman, 1957):
The Policy Bellman Operator
In practice, we often work with a specific policy rather than always taking the maximum action. This gives us the policy Bellman operator (Sutton & Barto, 2018):
Here, represents the true environment dynamics - how states transition given actions. However, we typically don’t have access to these true dynamics.
Empirical Bellman Operators
In real-world settings, we work with a finite dataset of transitions . This leads us to use an empirical Bellman operator that estimates values from samples (see Kumar et al, 2020 and Haskell et al, 2013)
This allows us to write the practical Q-learning update as:
Challenges in Offline Reinforcement Learning
Offline RL faces several fundamental challenges that make it significantly more difficult than online RL (Levine et al., 2020):
1. Action Distribution Shift
When learning from a fixed dataset, the learned policy may attempt to take actions that were never seen in the training data. This creates an action distribution shift where:
- The Q-function must estimate values for unseen state-action pairs
- These estimates are often unreliable since they’re based on extrapolation
- The policy might exploit these overoptimistic estimates
2. Compounding Errors
Unlike online RL where the policy can correct its mistakes through environment interaction (Kumar et al., 2019):
- Errors in Q-value estimates compound over time
- The policy can’t verify its decisions through exploration
- This often leads to systematic overestimation of Q-values
3. Limited State Coverage
The fixed dataset may not cover important parts of the state space (Fu et al., 2020):
- Some critical states might be missing entirely
- The distribution of states might be skewed
- The policy needs to handle out-of-distribution states gracefully
Conservative Value Estimation
(see Kumar et al, 2020)
The key insight of Conservative Q-Learning (CQL) is that Q-functions learned from offline data tend to be overoptimistic about unseen actions. To address this, CQL introduces a conservative value estimation approach.
Distribution Matching
Let’s define two key distributions:
- : The state distribution from our offline dataset (generated by behavior policy )
- : A new policy that we’ll use to lower-bound our Q-values
Together, these form the joint distribution that we’ll use for conservative estimation. The constraint ensures we only consider actions that exist in our dataset.
Conservative Q Update
The first CQL update equation adds a conservative term to the standard Bellman update:
This equation has two parts:
- The standard Bellman error that drives Q-values toward their TD targets
- A new conservative term that pushes down Q-values under distribution
The conservative term acts as a regularizer - it prevents Q-values from becoming too large for actions we haven’t seen in our dataset. The hyperparameter controls how conservative we want to be.
Tightening the Lower Bound
While pushing down all Q-values helps prevent overoptimism, it might be too conservative. We can improve this by simultaneously maximizing Q-values for actions we’ve actually seen in our dataset:
This modified objective:
- Still pushes down Q-values for out-of-distribution actions (first expectation)
- But pushes up Q-values for in-distribution actions (second expectation)
- The difference between these terms creates a “gap” between seen and unseen actions
This relative conservative term helps maintain a meaningful ordering of Q-values while still preventing overoptimism about unseen actions. It’s like saying “make unseen actions look worse than the actions we know about” rather than just “make all actions look bad.”
The 1/2 term is a scaling factor commonly used in squared error loss functions because when we take the derivative, it cancels with the 2 from the power rule to give cleaner gradients. While this doesn't change which parameters minimize the loss, it makes the optimization process more numerically stable.
Conservative Q-Learning Algorithm
CQL faces a key design choice: how should we use our conservative Q-values for policy optimization? There are two main approaches:
-
Full Policy Iteration: We could alternate between:
- Computing conservative Q-values for the current policy , and Performing one step of policy improvement
- However, this approach is computationally expensive.
-
Online Algorithm: Since our policy is typically derived from the Q-function anyway, we can make the algorithm more efficient by:
- Choosing to approximate the policy that would maximize the current Q-function iterate
- This gives us an “online” algorithm where Q-value estimation and policy improvement happen simultaneously
The authors implement the second approach by solving a min-max optimization problem:
The term actively searches for actions that might be overvalued, making our conservative estimate more robust. The regularizer prevents this search from becoming too aggressive. This formulation gives rise to a family of CQL variants, denoted as CQL(), each characterized by a different choice of regularizer.
CQL(H): A Practical Variant
CQL(H) is one of two important variants in the CQL family, demonstrating how different choices of regularizer lead to different practical algorithms.
In CQL(H), we choose:
- The regularizer to be the negative KL-divergence against a prior:
- The prior to be a uniform distribution over actions
This choice leads to two key results:
-
The optimal has a closed-form solution (see Appendix A of the paper): (The second proportionality follows because is uniform)
-
When we plug this optimal back into the objective, the first term becomes a soft maximum over Q-values:
There’s also a second variant where is set to the previous policy , which results in an exponentially weighted average of Q-values from actions sampled from .