Kickstarters
Deep Dive
Trust Region Policy Optimization
Trust Region Policy Optimization (TRPO) is a reinforcement learning algorithm used for optimizing policies in Markov Decision Processes (MDPs). It is a model-free, on-policy algorithm that uses a trust region approach to ensure that the policy update does not deviate too far from the current policy.
Background
TRPO is a type of policy gradient method, which is a class of reinforcement learning algorithms that optimize policies by directly computing the gradient of the expected reward with respect to the policy parameters. Policy gradient methods have been shown to be effective in high-dimensional and continuous action spaces, where value-based methods such as Q-learning may struggle.
TRPO was introduced by Schulman et al. in 2015 as a way to address some of the limitations of previous policy gradient methods, such as the difficulty in choosing appropriate step sizes and the tendency for updates to be too aggressive and lead to policy collapse.
Algorithm
The TRPO algorithm works by iteratively optimizing the policy by taking steps in the direction of the policy gradient while ensuring that the update does not deviate too far from the current policy. This is achieved by constraining the size of the policy update using a trust region, which is a region around the current policy where the update is guaranteed to improve the policy performance.
The objective function for TRPO is given by:
where are the policy parameters, are the parameters from the previous iteration, is a trajectory sampled from the current policy, is the probability distribution over trajectories under the old policy, is the reward function, and is the time horizon.
The policy update is then given by:
where is the surrogate objective function:
where is the probability of taking action in state under the policy , is the probability under the old policy, and is the advantage function, which measures how much better it is to take action in state under the new policy compared to the old policy.
The trust region constraint is then imposed by solving the following optimization problem:
where is the Kullback-Leibler divergence between the old policy and the new policy, and is the maximum allowed divergence. This constraint ensures that the policy update is not too far from the old policy, while still allowing for significant improvements in policy performance.
Further Readings
- Proximal Policy Optimization
- Deep Deterministic Policy Gradient
- Asynchronous Advantage Actor-Critic
Contents
Background
Algorithm
Further Readings