3 Actor-critic algorithms

Samuele Bolotta
35 min readMay 25, 2022

This is the third of a series of articles in which I summarize the lectures from CS285 held by Professor Sergey Levine, to whom all credit goes. All images are taken from his lectures. This article I wrote is an introduction to deep reinforcement learning. Actor-critic algorithms build on the policy gradient framework that we discussed in this article. On top, they’re also augmented with learned value functions and Q-functions.

Improving the policy gradient

The reward to go is an estimate of the expected reward if we take action a in state s. Can we get a better estimate of this quantity?

Let’s imagine that this curvy line represents one of the trajectories that you sampled and your reward to go is calculated at a particular time step. So, this green circle represents the state coming from sample i and at time step t. At that point in time, we’re going to calculate an estimate of our reward to go and then we’re going to multiply that by the probability of our actions.

The way that we calculate this estimate is by summing up the rewards that we actually got along that trajectory:

But that trajectory represents just one of the many possibilities, so if we were to somehow accidentally land in the same exact state again and then run our policy just like we did on this rollout, we might get a different outcome simply because the policy and the MDP have some randomness in them. Right now, we’re using a single step estimate for the reward to go, but in reality there are many possibilities for what might happen next, so we would have a better estimate of the reward to go if we could actually compute a full expectation over all these different possibilities.

And the reason that there are many possibilities is simply because there’s randomness in the system: our policy has randomness and our MDP has randomness, but this randomness can be quite significant, which means that our single sample estimate that we got by summing up the rewards that we actually obtained in that trajectory might be quite far off from the actual expected value.

This problem directly relates to the high variance of the policy gradient. The connection is that the policy gradient way of calculating the reward to go is a single sample estimate of a very complex expectation. The fewer samples you use to estimate an expectation, the higher the variance of your estimator will be. If we could somehow generate a million samples, then we would have much lower variance; if we could somehow calculate this expectation exactly, we would have much lower variance. So if we had access to the true expected reward to go, defined as the true expected value of the sum of rewards that we get starting from state i at timestep t and action i at timestep t, then the variance of our policy gradient will be much lower. Then, if we had this Q-function, we could simply plug it in in place of the reward to go and get a lower variance policy gradient.

In the previous article, we also saw how baselinescan lower the variance of a policy gradient even further. Can we apply a baseline even when we have the true Q-function? The answer is yes. We can subtract some quantity b:

Where the average reward is a good choice for b, although not the optimal choice. So, what do we average? We could average Q values and then we will have this appealing property that the policy gradient will increase the probability of actions that are better than average and decrease the probability of actions that are worse than average. But it turns out that we can lower the variance even further, because the baseline can actually depend on the state. It can’t depend on the action, because that leads to bias, but you can make it depend on the state. So, if you make the baseline depend on the state, a better thing to do would be to compute the average reward over all the possibilities that start in that state. Not just the average reward over all possibilities of that time step, but specifically in that specific state. And if you average your Q values over all the actions in a particular state, that’s simply the definition of the value function. A very good choice for the baseline is the value function.

The difference between the Q value and the value function represents your estimate of how much better that action a is on average than the average action you would take in the state s from sample i at timestep t. This Q-V term is so important that we have a special name for it: we call it the advantage function, because it represents how advantageous the action a from sample i at timestep t is as compared to the average performance that you would expect the policy to get in the state s from sample i at timestep t.

State and state-action value functions

A Q function, or state-action value function, represents the total expected reward that you expect to get if you start in state s, take action a and then follow your policy:

The value function is the expected value over all the actions in state s under your current policy:

The advantage function is the difference between these two quantities and the advantage function represents how much better the action a is as compared to the average performance of your policy in state s. So we can get a very good estimate of the policy gradient if we simply multiply the gradient of the logarithm of the policy by the advantage value.

Of course, in reality we won’t have the correct value of the advantage, we’ll have to estimate it for example using some function approximator. So, the better our estimate of the advantage, the lower our variance will be. It’s also worth mentioning that the kind of actor-critic methods that we’ll discuss in today’s article don’t necessarily produce unbiased estimates of the advantage function. So while the policy gradient we’ve discussed so far has been unbiased, if your advantage function is incorrect, then your entire policy gradient can also be biased. Usually, we’re okay with that because the enormous reduction in variance is often worth the slight increase in bias that we obtain from using approximate Q-values and value functions.

To summarize the conventional policy gradient uses a kind of a Monte Carlo estimate of the advantage calculated by using the one sample that you have in the remainder of the current trajectory by summing up the rewards in your trajectory and subtracting a baseline. This is an unbiased but high variance single sample estimate and we can replace it with an approximate advantage function which itself is usually calculated from an approximate Q function or an approximate value function — and get a much lower variance estimate.

Value function fitting

Fitting value functions: we have three possible quantities, namely Q, V and A. Ultimately, we want A, but the question we might ask is: which of these three should we fit and what should we fit it to?

The Q function is the expected value of the reward that we’ll get when we start from state s, take action a and then follow our policy.

One very convenient property of this is that because s and a are not actually random variables, we can rewrite the Q function as simply the current reward plus the expected value of the reward in the future — because the current reward depends on the state and on the action, and they are not random.

This quantity that we’re adding is simply the expected value of the value function at the state s (t+1) that we will get when we take action a in state s t.

We can similarly write the Q function in terms of the value function as the current reward plus the expected value of the reward of the value function of the next time step. The expectation here is of course taken with respect to the transition dynamics.

We can make a small approximation where we could say that the actual state s (t+1) that we saw in the current trajectory is kind of representative of the average s (t+1) that we’ll get:

At this point, we’ve made an approximation. This is not an exact equality, we’re essentially approximating the distribution over states of the next time step with again a single sample estimator. But now it’s a single sample estimator for just that one time step and everything after that is still integrated out as represented by the value function.

The reason that we would want to do that is because if we then substitute this approximate equation for the Q value into the equation for the advantage, we get this very appealing expression where the advantage is now approximately equal to the current reward plus the next value minus the current value:

This is still an approximation because, to be exact, the next value needs to be in expectation over all possible values of s (t+1), whereas we’ve just substituted the actual s(t+1) that we saw. But what’s very appealing about this equation is that now it depends entirely on V, and V is more convenient to learn than Q or A — because Q and A both depend on the state and the action, whereas V depends only on the state. When your function approximator depends on fewer things, it’s easier to learn because you won’t need as many samples. So maybe what we should do is just fit the current value.

When we fit the current value, we would have some kind of model such as a neural network that maps states s to approximate values. Let’s talk about the process of fitting the current value; this process is sometimes referred to as policy evaluation, because the current value represents the value of the policy at every state, so calculating the value is evaluation.

Policy evaluation

One thing that we could do is we could use Monte Carlo policy evaluation; in a sense, this is what policy gradients do. We run our policy many many times and then sum together the rewards obtained along the trajectories generated by the policy and use that as an unbiased, but high variance estimate of the policy’s total reward. So we could say that the value at state s(t) is approximately the sum over all the rewards that we saw after visiting state t along the trajectory that visited state s(t). Ideally, what we would like to be able to do is sum over all possible trajectories that could occur when you start from that state, because there’s more than one possibility. Unfortunately, in the model free setting, this is generally impossible because this requires us to be able to reset back to the state s(t) and run multiple trials starting from that state. Generally, we don’t assume that we’re able to do this; we only assume that we’re able to run multiple trials from the initial state.

Monte Carlo evaluation with function approximation

What happens if we use a neural network function approximator for the value function with this kind of Monte Carlo evaluation scheme? At every state that we visit, we sum together the remaining rewards and that will produce our target values, but then instead of plugging the “reward to go”’s directly into our policy gradient, we’ll actually fit a neural network to those values and that will actually reduce our variance. This is because even though we can’t visit the same state twice, our function approximator or our neural network will actually realize that different states that we visited in different trajectories are similar to one another. This is essentially generalization — it means that your function approximator understands that nearby states should take on similar values. It’s not as good as making multiple roads from the same state, but it’s still pretty good.

The way that we would do this is we would generate training data by taking all of our rollouts and for each state along every rollout we create a tuple consisting of the state s from sample i at timestep t and a label corresponding to the sum of rewards that we saw starting from s(i,t). We are going to call these labels y(i,t), which are the targets. Then we’ll solve a supervised regression problem; we’ll train our neural network value function, so that its parameters minimize the sum over all of our samples of the squared error between the value function’s prediction and the single sample Monte Carlo estimate of the value at that state. Of course if our function approximator massively overfits and produces exactly the training label at every single state, then we wouldn’t have gained much as compared to just directly using the y(i,t) values in our policy gradient, but if we get generalization we’ll actually get lower variance, because this function will now average out the dissimilar labels at similar states.

Can we do better?

The ideal target that we would like to have when training our value function is the true expected value of rewards starting from the state s(i,t).

The Monte Carlo target that we used before uses a single sample estimate of this quantity.

But we can also use the relationship that we wrote out before, where we saw that a Q function is simply equal to the reward of the current time step plus the expected reward starting from the next time step. And if we write out this quantity then we can perform the same substitution as before and actually replace the second term in the summation with our estimate of the value function. This is a better lower variance estimate of the reward to go than our single sample estimator. This says: let’s use the actual reward that we saw the current time step plus the value at the actual state that we saw the next time step. Of course we don’t know the value, so we’re going to approximate that simply by using our previous function approximator. We’ll assume that our previous value was okay and we’ll plug it in in place of the current value. This is what is called a bootstrap estimator. Here we’re directly going to use the previous fitted value function to estimate this quantity.

So now our training data will consist of tuples of the states that we saw and labels that correspond to the reward that we actually got at that time step plus the estimate of the value function at the actual next state.

This estimate might be incorrect, but as we repeat this process hopefully these values will get closer and closer to the correct values and because the values are averaging together all possible future returns, we expect their variance to be lower. Now our target value is y(i,t):

Our training process just like before is going to be supervised regression onto these targets. This is sometimes referred to as a bootstrapped estimate and the bootstrapped estimate has lower variance because it’s not using a single sample estimator, but it also has higher bias because it might be incorrect.

From evaluation to actor critic

A basic batch actor-critic algorithm can look something like this:

Step one is going to be to generate samples by running rollouts through our policy. Step two is to fit our approximate value function to those sampled rewards, instead of just naively summing up all the rewards. Step three, for every state-action tuple that we sampled, we evaluate the approximate advantage as the reward plus the approximate value of the next state minus the value of the current state. Step four, we use these advantage values to construct a policy gradient estimator by taking the gradient of the logarithm of the policy at every time step and multiplying it by the approximate advantage. Finally, step five like is to take a gradient descent step.

The part that we talked about when we discussed policy evaluation is mostly the step two: how do we actually fit the value function? We talked about how we can make a number of different choices. We could fit it to single sample Monte Carlo estimates, meaning that we actually sum up the rewards that we got along that trajectory, and that gives us our target values. We also talked about how we could use bootstrap estimates where we use the actual observed reward plus the estimated value at the next state by using our previous value function estimator and this gives us a few different options for how to fit the critic.

A little aside: let’s discuss what happens when we fit value functions with this bootstrap rule in infinite horizon settings. The trouble that we might get into is that if the episode length is infinite, then each time we apply this bootstrap rule our value function will increase. How can we modify this rule to ensure that we can always have finite values and that we can handle infinite horizon settings? One very simple trick is to assume that we want a larger reward sooner rather than later. The way that we’re going to do this is we will introduce a little multiplier, gamma, in front of the value:

0.99 works really well. This number changes your MDP. Imagine having four states. You can transition between those four states and those transitions are governed by some probability distribution. When we add gamma, one way we can think about this is that we’re adding a fifth state, a death state, and we have a probability of one minus gamma of transitioning to that death state at every time step:

Once we enter the death state, we never leave, so there’s no resurrection in this MDP, and our reward is always zero — so that means that the expected value for the next time step will always be expressed as gamma times its expected value in the original MDP plus one minus gamma times zero.

Discount factors

First, could we introduce discount factors into regular Monte Carlo policy gradients? Yes, we can just take that single sample reward to go calculation and we can put gamma into it:

The equation I have here is exactly the reward to go calculation we had before, except that now I’ve added this gamma raised to the power t prime minus t in front of the reward. That means that the first reward has a multiplier of one, the next one gets a multiplier of gamma, the next one gets a multiplier of gamma squared and so on. So, we’re much more affected by rewards that happen closer to us in time. This type of estimator is essentially the single sample version of what you would get if you were to use the value function with a discounted bootstrap

There is another way that we can introduce a discount into the Monte Carlo policy gradient, which is very similar but has a subtle and important difference:

This option is saying that because you have this discount, not only do you care less about rewards further in the future, you also care less about decisions further in the future. So if you’re starting at time step one, rewards in the future matter less but also your decisions matter less — because your decisions further in the future will only influence future rewards. So, as a result you actually discount your gradient at every time step by gamma to the t minus one and essentially it means that making the right decision at the first time step is more important than making the right decision at the second time step, because the second time step will not influence the first time step’s reward.

This is in fact the right thing to do if you truly want to solve a discounted problem. But in reality the version that we actually usually use is option one. Think about a cyclic continuous RL task in which the goal is to make the agent run as far as possible. While we can model this as a task with discounted reward, in reality we really do want this guy to run as far as possible, ideally infinitely far — so we don’t really want the discounted problem. What we want to do is we want to use the discount to help get us finite values so that we can actually do RL, but then what we’d really like to do is get a solution that works for running for arbitrarily long periods of time.

What happens when we introduce the discount into our actor-critic algorithm? Well, the only thing that changes is step three:

In step three, you can see that we’ve added a gamma in front of the value of the next step. One of the things we can do with actor-critic algorithms once we take them into the infinite horizon setting is we can actually derive a fully online actor-critic method. So far when we talked about policy gradients, we always used policy gradients in a kind of episodic batch mode setting, where we collect a batch of trajectories, each trajectory runs all the way to the end and then we use that batch to evaluate our gradient and finally update our policy. But we could also have an online version when we use actor-critic, where at every single time step we also update our policy. Here’s what an online actor-critic algorithm would look like: we would take an action, sampled from a probability distribution, and get a transition. In step two, we update our value function by using the reward plus the value of the next state as our target. Because we’re using a bootstrapped update, we don’t need to know what state we’ll get at the following time step; we just need the one next time step s prime. In step three, we evaluate the advantage as the reward plus the value function of the next state minus the value function of the current state. And then using this we can construct an estimate for the policy gradient by simply taking the gradient of the logarithm of the policy multiplied by the advantage that we just calculated and then we can update the policy parameters with policy gradient. And then repeat this process and we do this at every single time step.

Actor-critic design decisions

There are a few problems with this recipe when we try to do DRL. In order to actually instantiate these algorithms as DRL algorithms, we have to pick how we’re going to represent the value function and the policy. There are a couple of choices we could make.

One very reasonable starting choice is to have two completely separate networks — one network that maps a state to the value and one separate network that maps that same state to the distribution over actions. These networks have nothing in common.

This is a convenient choice because it’s relatively simple to implement and it tends to be fairly stable to train. The downside is it may be regarded as somewhat inefficient because there’s no sharing of features between the actor and critic; this could be a more important issue if for example you are learning directly from images and both these networks are convolutional neural nets maybe you would really want them to share their internal representations so that for example if the value function figures out good representations first, the policy could benefit from them. In that case you might opt for a shared network design where you have one trunk that maybe represents the convolutional layers and then you have separate heads — one for the value and one for the policy action distribution.

This shared network design is a little bit harder to train, it can be a little bit more unstable because those shared layers are getting hit with very different gradients. The gradients from the value regression and the gradients from the policy gradient will be on different scales, they’ll have different statistics and therefore it might require more hyperparameter tuning in order to stabilize this approach. But it can in principle be more efficient because you have these shared representations.

There is another important point that we have to discuss before we get an actual practical deep reinforcement learning actor critic method and that’s the question of batch sizes. As described here, this algorithm is fully online meaning that it learns one sample at a time; so it takes an action, gets a transition, updates the value function on that transition and then updates the policy on that transition. Both updates use just one sample. Now, we know from the basics of deep learning that updating deep neural nets with stochastic gradient descent using just one sample is not going to work very well, so those updates are going to have a little too much variance. These updates will all work best if we have a batch and one of the ways that we could get a batch is by using parallel workers

Here’s the idea. This is the most basic kind of parallelized actor-critic. Instead of having just one data collection thread, you might run multiple simulators and each of them will choose an action in step one and generate a transition, but they’re going to use different random seeds so they’ll do things that are a little bit different. And then you will update, in step two and step four, using data from all of the threads together, so the update is synchronous, meaning that you take one step in step one for each of the threads, then collect all the data into your batch and use it to update the value function — and then use it to update the policy synchronously. Then you repeat this process.

So this will give you a batch size equal to the number of worker threads. It can be a little bit expensive, because if you want a batch size 32, then you need 32 worker threads; but it does work decently well. It can be made even faster if we make it into asynchronous parallel actor-critic, meaning that we basically drop the synchronization point. Now we have these different threads that are all running at their own speed and when it comes time to update we’re going to pull in the latest parameters and we’re going to make an update for that thread but we will not actually synchronize all the threads together. So as soon as we accumulate some number of transitions from all the workers, we’ll make an update.

The problem with this approach of course is that the actual transitions might not have been collected by exactly the same parameters, so if one of the threads is lagging behind maybe its transition was generated by an older actor and then you will basically not actually update until you get transitions from faster threads and those will be using a newer actor. So in general all of the transitions that you’re pulling together into your batch here may have been generated with slightly different actors. They’re not going to be too different because these threads aren’t going to be running at such egregiously different rates, but they will be a little bit lagging behind. In fact, the asynchronous update is not mathematically equivalent to the standard synchronous update, because you have the small amount of lag which is similar to what you get with asynchronous SGD; but in practice it usually turns out that making the method asynchronous leads to gains and performance that outweigh the bias incurred from using slightly older actors. The crucial thing here is “slightly” older, because the actors are not going to be too old — if they are, then of course this won’t work.

This might get us thinking about another question: in the asynchronous actor critic algorithm, the whole point was that we could use transitions that were generated by slightly older actors. If we can somehow get away with using transitions that were generated by much older actors, then maybe we don’t even need multiple threads, maybe we could use older transitions from the same actor. Basically, we could use a history and load in transitions from that history and not even bother with multiple threads. That’s the principle behind off-policy actor-critic. The design of off-policy actor-critic is that now you’re going to have one thread and you’ll update with that one thread, but when you update you’re going to use a replay buffer of all transitions that you’ve seen and you will actually load your batch from that replay buffer. So you’re not going to necessarily use the latest transition, rather you’ll collect the transition store in the replay buffer and then sample an entire batch from that replay buffer.

At this point we have to modify the algorithm because doing this naively won’t work. This batch that we loaded in from the replay buffer definitely came from much older policies, so it’s not like the asynchronous actor critic where the transitions came from just slightly older actors and we could just ignore that; now it’s coming from much older actors and we can’t ignore that we have to actually change our algorithm. Then of course we’re going to form a batch for each of these updates by using previously seen transitions.

In an off-policy actor-critic algorithm, we’re going to take an action as usual from our latest policy, get the corresponding transition and then instead of using that transition for learning we’ll actually store it in our replay buffer; then, we will sample a batch from that replay buffer. Then we’re going to update our value function using targets for each of these transitions in our batch. Then we’ll evaluate our advantage for each of the samples in our batch and then we’ll update our policy gradient by using that batch. So now the policy gradient is also averaged over n samples and then we’ll apply the policy gradient like before.

However, this algorithm is not going to work. The first problem is that when you load these transitions from the replay buffer, remember that the actions in those transitions were taken by older actors, so when you use those older actors to get the action and compute the target values, that’s not going to give you the right target value; it’ll give you the value of some other actor, not your latest actor. The second issue is that for that same reason you can’t compute the policy gradient this way; it is very very important, when computing the policy gradient, that we actually get actions that were sampled from our policy, because this needs to be an expected value under the policy. If that is not the case we need to employ some kind of correction such as importance sampling.

First, let’s talk about fixing the value function. If the value function tells you the expected reward you will get if you start in state s and then follow the policy, the Q function tells you the reward you’ll get if you start in state s, then take action a and then follow the policy. Notice here that there’s no assumption that the action actually came from your policy, so the Q function is a valid function for any action; it’s just that in all subsequent steps you follow the policy. So we will not learn V, but we will instead learn Q; this neural network will take in a state and an action and output a Q value. Otherwise, the principle behind the update is the same: we’re going to compute target values and then we will regress onto those target values. It’s just that now we’ll give the action as an input to the Q function. Another way to think about it is that we can no longer assume that our action came from our latest policy, so we’ll instead learn a state-action value function that is valid for any action so that we can train it even using actions that didn’t come from the policy, but then query it using actions from the policy.

The issue here is that before I was learning v and I was using v in my targets, but now I’m learning Q and yet I still need v for my target values. So where do I get that? Well, remember that the value function can also be expressed as the expected value of the Q function where the expectation is taken under your policy, so what we can do is we can replace the v in our target value with Q evaluated at the next action a, except that this action is not coming from our replay buffer, it is actually the action that your current policy would have taken if it had found itself in that state. We are exploiting the fact that we have functional access to our policy so we can ask our policy what it would have done if it had found itself in the in this old state, even though it never actually happened.

Now we’ve resolved our issue with the value function, but we have to deal with the policy gradient. All we’re going to do is we’re going to use the same trick, but this time we’re going to use it for the next action, we are going to use it for the current action. In order to evaluate the policy gradient, we just need to figure out an action sampled from the latest policy at the state.

What else is left?

There is still a little bit of an issue, because the state that we’re actually using didn’t come from the state marginal of the latest policy, it came from the state marginal of an old policy. Unfortunately there’s basically nothing we can do here, so this is going to be a source of bias in this procedure and we’ll just have to accept it. The intuition for why it’s not so bad is because ultimately we want the optimal policy on pθ(s), but we get the optimal policy on a broader distribution, so our replay buffer will contain samples from the latest policy as well as many samples from other older policies so that the distribution is sort of broader than the one we want. For this reason, we’re not going to miss out on the states from our latest policy, we just also have to be good on a bunch of other states which we might never visit, so we’re doing kind of extra work, but we’re not missing out on important stuff and that’s the intuition for why this basically tends to work.

Some implementation details

Oftentimes there’s much fancier things we can do for step four, for example one thing we could use is something called the reparameterization trick which I’ll discuss in another article. There are also many fancier ways to fit the Q function and we’ll discuss this in the next two articles when we’ll talk about Q learning.

Critics as baselines

Now I’m going to talk about another way that we can use the critic by incorporating the critic as a baseline to the policy gradient and this is going to have some interesting trade-offs as compared to the standard actor critic algorithm that we’ve discussed so far.

This is the equation for the actor critic that we discussed so far:

The actor critic consists of the grad log term multiplied by the reward plus gamma times the next value minus the current value.

The policy gradient consists of the grad log term times the sum of the rewards to go minus a baseline.

The actor critic policy gradient estimator has the advantage that it drastically lowers the variance, because the functional approximator in the critic integrates in all those different possibilities for what might happen instead of relying on a single sample; unfortunately, the actor critic gradient estimator also has the disadvantage that it’s not no longer unbiased. The reason for that is that if your value function is slightly incorrect, which it might be because it’s a function approximator trained on a finite number of samples, then you can’t show anymore that in expectation this gradient will actually converge. On the other hand, the original policy gradient is unbiased, so even though it might have high variance, it’ll come out to the right value.

So, could we get a policy gradient estimator that is still unbiased but uses the critic in some other way to lower the variance? The answer is that we can we can actually construct a policy gradient estimator that has slightly higher variance than the actor critic version, but no bias like the policy gradient version. The way that we can do this is by using what’s called a state-dependent baseline. It turns out that we can prove that not only does the policy gradient remain unbiased when you subtract any constant b, it actually still remains unbiased if you subtract any function that depends on only the state and not on the action. A very good choice for the baseline is the value function, because you would expect this single sample estimator in expectation to come out to be equal to the value function, so if you use the value function as the baseline then the numbers that are multiplying the grad log should in expectation be smaller, which means that their variance is smaller, which means that the variance of your entire policy gradient is smaller. This doesn’t lower the variance as much as the full actor-critic algorithm, because you still have the sum over future rewards, but it’s much lower than a constant baseline and it’s still unbiased.

What if we make the baseline depend on even more things, for example on the state and the action? Will we get an even lower variance that way? The answer is yes, but at that point things get much more complicated, so that’s what we’ll talk about now. Methods that use state and action dependent baselines are sometimes referred to as control variates in the literature.

Control variates: action-dependent baselines

The true advantages are approximate advantage that we use in policy gradients when we have a state dependent baseline. They are the sum of all future rewards minus the current value:

This is nice because it has lower variance, but we can make the variance even lower if we subtract the Q value. If we do so, this has the nice property that it actually goes to zero if your critic is correct. So if your critic is correct and your future behavior is not too random, then you’d expect these quantities to eventually just converge to zero.

Unfortunately if you plug in this advantage estimator into your policy gradient, the policy gradient is no longer correct because there’s an error term that you have to compensate for. Unlike the standard baseline which integrates to zero in expectation, an action dependent baseline no longer integrates to zero, it integrates to an error term and you have to account for that error term. So if you incorporate a baseline that depends on both the state and action and account for the error term, then you get this equation:

This equation is a valid estimator for the policy gradient even if your your baseline doesn’t depend on on the action, but in that case the second term basically vanishes; but if your baseline depends on the action, the second term is no longer zero. So the first term is just your policy gradient with your baseline, the second term is the gradient of the expected value under the policy of your baseline now. This second term looks a lot like the original policy gradient, so is this really any better? It turns out that this is actually a lot better in some cases. For example, in many cases the second term can actually be evaluated very very accurately. If you have discrete actions you can sum over all possible actions; if you have continuous actions then you can sample a very large number of actions, because evaluating the expectation of our actions doesn’t require sampling new states, so it doesn’t require actually interacting with the world, which means that you can generate many more samples from the same state, something that we could not do before when we had to actually make entire rollouts. Furthermore, in many continuous action cases, if you make a careful choice of the class of distributions on the class of Q functions, this integral also has an analytic solution. For example, the expected value of a quadratic function under a gaussian distribution has an analytic solution. So in many cases the second term can be evaluated in such a way that its variance is zero or very close to zero and the first term is low variance. This kind of trick can be used tocprovide a very low variance policy gradient, especially if you can get a good Q function estimator.

So far, we talked about ways that we can use critics and get policy gradient estimators that are unbiased, but can we also use critics and get policy gradients that are still biased, but only a little bit biased?

Eligibility traces & n-step returns

First, let’s take a look at the advantage estimator in an actor-critic algorithm:

This has much lower variance than the policy gradient but higher bias if the value is wrong and it’s always at least a little bit wrong. We can also take a look at the Monte Carlo advantage estimator that we used in policy gradient algorithms:

This has no bias, but it has a much higher variance because of the single sample estimate. So at this point you might wonder, can we get something in between? Could we for example sum over the rewards over the next five time steps instead of infinite time steps and then put the value at the end of that? It turns out that we can and we can use this to get a very nice way to control the bias variance trade-off. There are a couple of factors that make this interesting. First, when you’re using a discount typically your reward will decrease over time because you’re discounting it, so that means that the bias you get from the value function is much less of a problem if you put the value function not at the next time step, but further into the future where the rewards are smaller. On the other hand, the variance that you get from the single sample estimator is also much more of a problem further into the future. In general, we can be way more sure about what will happen tomorrow than about what will happen in ten years. Usually if you have these many possible trajectories emanating from the current state, they’ll branch out a lot more further into the future and they’ll be clustered much closer together in the present, which means that you have much higher variance far in the future and much lower variance closer to the present. That means that if you’re going to use your single sample estimator for a chunk of this trajectory, you’d rather use it close to the present and then cut it off before the variance gets too big and then replace it with your value function, which has much lower variance.

The value function will then account for all these different possibilities, but potentially with some bias. The way that you can do this is by constructing what’s called an n-step return estimator. In an n-step return estimate you sum up rewards until some timestep n and then you cut it off and replace it with the value function. Here’s the equation for the n-step return estimator:

You sum up your rewards, but now you don’t sum them up until infinity, you sum them from t until t + n. You still subtract off your baseline, but then you have that remaining chunk, everything from n + 1 until infinity, and that you replace with your value function. So you evaluate your reward function at state in timestep t + n and you multiply by gamma to the n. This is an n-step return estimator and often times choosing n greater than one works better than n = 1. Generally, the larger n is, the lower the bias and the higher the variance. Very often the sweet spot is not at n = 1, nor at n = infinity and you may want to use an intermediate value. The last trick that I’m going to discuss is a way to actually generalize the n step return estimator and actually construct a hybrid estimator that uses many different n-step returns.

Generalized advantage estimation

Do we have to choose just one n? Do we have to make this hard slice at a particular point in time? What if we want to actually construct all possible n-step return estimators and average them together? Here is the equation for the n-step return estimator:

We can construct a kind of fused estimator which we’re going to call GAE for generalized advantage estimation, which consists of a weighted average of all possible n-step return estimators with a weight for different n. The way that you can choose your weight is by utilizing that insight that you’ll have more bias if you use small n, more variance if you use larger n. So you mostly prefer cutting earlier, because then you have less variance, but you want to keep around some of those traces that go further in the future.

A pretty decent choice that leads to an especially simple algorithm is to use an exponential falloff. If you use a exponential falloff, at every timestep you have one minus lambda times the value function at that timestep plus lambda of the GAE estimator from there onwards.

That gives you this weighted sum of all possible step return estimators, but it turns out that we can collect the terms and get an even simpler form by expressing the GAE estimator this way:

If you just construct a weighted sum of these one step advantage estimators at every time step weighted by gamma times lambda to the t prime minus t, you recover exactly the sum of all possible n-step returns. So this is going to allow you to trade off bias and variance just by choosing this lambda, which acts very similarly to a discount. The larger lambdas look further in the future, whereas the smaller lambdas use value functions closer to the present. This has a very similar effect as a discount. This suggests that even if we didn’t have lambda, the role of gamma would be a kind of bias variance trade-off and that’s in fact the case.

So the discount can also be interpreted itself as a kind of variance reduction, because smaller discounts will result in your Monte Carlo single sample estimator putting lower weight on rewards far in the future, which are exactly the rewards that you’d expect to have high variance

Summary

We discussed how an actor-critic algorithm consists of several parts: an actor, which is the policy, and a critic, which is the value function. The actor-critic algorithm can be viewed as a version of policy gradient with substantially reduced variance and like the policy gradient and all other RL algorithms it consists of three parts: one component where we generate samples, one component where we estimate our return, which now corresponds to fitting the value function, and one component where we use gradient ascent to update our policy. Just like in policy gradients, policy evaluation refers to the process of fitting the value function and discount factors are something that we can use to make it feasible to do policy evaluation with infinite horizons. This has several different interpretations; you can interpret it as the fear of death, meaning that you would like to receive rewards sooner rather than later before you die, but you can also interpret as a kind of variance reduction trick. We talked about the design of actor-critic algorithms; you could have one network with two heads or two separate networks and you could have batch remote or online actor-critic algorithms. Also, you could use parallelism to get mini batch sizes that are larger than one. We also talked about state dependent baselines and even action dependent control variants as another way to use the critic, while remaining unbiased. We talked about how we can combine these with n-step returns or even the generalized advantage estimator, which averages together many different n-step return estimators.

Feel free to drop me a message or:

  1. Connect and reach me on LinkedIn and Twitter
  2. Follow me on Medium

--

--

Samuele Bolotta

PhD student in Machine Learning — also interested in Neuroscience and Cognitive Science.