Generalized Advantage Estimation (GAE)
Background
From our last post, we derived the basic policy gradient method. However, the basic poliocy gradient method can really suffer from high variance issue. The reason for high variance is that the sum of the rewards can be a very complicated distribution and we are just drawing one sample based on what we derived last time as the following:
\[ \nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}(\sum_{t=0}^{T-1}\nabla_{\theta}log\pi_{\theta}(a_{i,t} \mid s_{i,t}))(\sum_{t=0}^{T-1}r(s_{i,t},a_{i,t})) \]
Reducing variance
Reward-to-go
One way to reduce the variance is that we can utilize causality, which means the policy cannot affect rewards in the past. Therefore, we can rewrite our sum of the rewards to be rewards-to-go:
\[ \nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}(\sum_{t=0}^{T-1}\nabla_{\theta}log\pi_{\theta}(a_{i,t} \mid s_{i,t}))(\sum_{t’=t}^{T-1}r(s_{i,t’},a_{i,t’})) \]
Discount factor
Another method is being greedy by making the rewards coming very soon weight more than the rewards coming in the future. A conceptual example would be if you were to give me a prize in a week, I would be very excited. If you were to give me the same prize in ten years, I would care much less about it. The reason why this can reduce variance is future rewards typically have much higher variance than the rewards coming soon. We can define a discount factor $\gamma$ inside our equation:
\[ \nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}(\sum_{t=0}^{T-1}\nabla_{\theta}log\pi_{\theta}(a_{i,t} \mid s_{i,t}))(\sum_{t’=t}^{T-1} \gamma^{t’-t}r(s_{i,t’},a_{i,t’})) \]
Baseline
The core idea of baseline is to subtract our “rewards-to-go” or Q-value by a constant or estimated value so that we only encourage policy that lead to a better than average Q-value. If we use a learned function $V_{\phi}^{\pi}(s_{i,t})$ to estimate the expectation of the sum of future rewards given a state:
\[ V_{\phi}^{\pi}(s_t) \approx \sum_{t’=t}^{T-1} E_{\pi_{\theta}}[r(s_{t’},a_{t’} \mid s_t)] \]
We can use the V-function to approximate the baseline and write our policy gradient as:
\[ \nabla_{\theta} J(\theta) \approx \frac{1}{N}\sum_{i=1}^{N}(\sum_{t=0}^{T-1}\nabla_{\theta}log\pi_{\theta}(a_{i,t} \mid s_{i,t}))(\sum_{t’=t}^{T-1} \gamma^{t’-t}r(s_{i,t’},a_{i,t’}) - V_{\phi}^{\pi}(s_{i,t’})) \]
Generalized Advantage Estimation (GAE)
With all the techniques above, we are diving into the definition for Generalized Advantage Estimation (GAE). We can define our Q-function to replace the “rewards-to-go”:
\[ Q^{\pi}(s_t, a_t) = \sum_{t’=t}^{T}E[r(s_t’, a_t’) \mid s_t, a_t] \]
Then the “rewards-to-go” subtracts baseline term is defined as our advantage $A^{\pi}$:
\[ A^{\pi}(s_t, a_t) = Q^{\pi}(s_t, a_t) - V^{\pi}(s_t) \]
Since our current state and once we take our action, these two terms will not be stocastic anymore, we can write advantage in the following fashion:
\[
A^{\pi}(s_t, a_t) \approx \delta_t = r(s_t, a_t) + \gamma V_{\phi}^{\pi}(s_{t+1}) - V_{\phi}^{\pi}(s_t)
\]
In this way, we can get rid of the high variance monte carlo returns but we also introduce bias into our estimation due to the modeling error of $V_{\phi}^{\pi}(s_t)$. In practice, the value function or the “critic” is trained using the discounted rewards or “rewards-to-go” so that we need to apply the $\gamma$ factor in front of $V_{\phi}^{\pi}(s_{t+1})$ to make the discounted factor work in a recursive fashion.
Methods trade-off
Method | Bias | Variance |
---|---|---|
Monte Carlo | No bias | Higher |
Critic | Higher bias | Lower |
With the summary of the trade-off, it is natural to combine the two methods to control the bias-variance tradeoff. We can add up $n$ steps from the current timestep $t$ to get a n-step return:
\[ A_n^{\pi} = \sum_{t’=t}^{t+n} \gamma^{t’-t}r(s_{t’},a_{t’}) + \gamma^n V_{\phi}^{\pi}(s_{t+n+1}) - V_{\phi}^{\pi}(s_t) = \sum_{t’=t}^{t+n} \gamma^{t’-t}\delta_{t’} \]
It can be noticed that if $n=0$, it is the same as $\delta_t$ function we have in the above section which has lower variance but higher bias. If $n=T-t-1$, it is the same as the Monte Carlo returns which has no bias but high variance.
To generalize this estimation, we do not have to only choose one value for $n$ but a weighted sum of different $n$ values. A typical way is to define GAE is using the exponentially-weighted average with $\lambda$ [1]: \[ A_{GAE}^{\pi}(s_t, a_t) = \sum_{t’=t}^{t+n} (\lambda \gamma)^{t’-t}\delta_{t’} \]
In practice, we can recursively compute the generalized advantage estimator to make it more efficient:
\[ A_{GAE}^{\pi}(s_t, a_t) = \delta_t + \lambda \gamma A_{GAE}^{\pi}(s_{t+1}, a_{t+1}) \]
Reference
[1] Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.