5 Deep reinforcement learning with Q-functions

32 min readMay 29, 2022

This is the fifth of a series of articles in which I summarize the lectures from CS285 held by Professor Sergey Levine, to whom all credit goes. All images are taken from his lectures. This article I wrote is an introduction to deep reinforcement learning. Actor-critic algorithms build on the policy gradient framework that we discussed in this article. On top, they’re also augmented with learned value functions and Q-functions, which we discussed in this article.

Although we learned that value-based methods in general are not guaranteed to converge, in practice we can actually get some very powerful and very useful reinforcement learning algorithms out of them and that’s what we’ll talk about today.

Problems with online Q iteration algorithm

First, we learned that the update in step 3, even though it looks like a gradient update, is not actually the gradient of any well-defined objective — so Q-learning is not gradient descent.

Second, when you sample one transition at a time, sequential transitions are highly correlated. So the state I see at timestep t is likely to be quite similar to the state that I see at time step t+1, which means that when I take gradient steps on those samples in step three, I’m taking gradient steps on highly correlated transitions. This violates commonly held assumptions for stochastic gradient methods and it has a pretty intuitive reason to not work very well. To understand why, let’s rewrite the algorithm by plugging in the equation for the target value right into the gradient update.

States that you see one right after the other are going to be strongly correlated and your target values are always changing, so even though your optimization procedure looks like supervised regression, in a sense it’s kind of chasing its own tail — it’s trying to catch up to itself and then changing out from under itself and the gradient doesn’t account for this change. The reason this might be a problem is if you imagine that this is your trajectory and you get a few transitions, you’ll locally overfit to these transitions, because you’re seeing very similar transitions:

Then you get some other transitions and then you overfit a little bit to those:

And so on. Then, when you start a new trajectory, your function approximator will have been left off at the point where it overfitted to the end of the previous one and will again be bad. So if it had seen all the transitions all at once it might have actually fitted all of them accurately, but because it’s seeing this very local highly correlated window at a time, it has just enough time to overfit to each window and not enough broader context to accurately fit the whole function.

We could borrow the same idea that we had in actor-critic: the particular solution we discussed was tto have multiple workers that each collect a different transition, collect a batch consisting of samples from all of these workers, update on that batch and repeat. This procedure can in principle address the problem with sequential states being highly correlated; the sequential states are still correlated, but now you get a batch of different transitions from different workers and across workers they are hopefully not correlated. So it doesn’t solve the problem fully, but it can mitigate it. Of course, just like we learned about in actor-critic, you could have an asynchronous version of this recipe where the individual workers don’t wait for a synchronization point, but instead query a parameter server for the latest parameters and then proceed at their own pace.

In fact, with Q-learning this recipe should in theory work even better, because you don’t even need the workers to use the latest policy, so the fact that you might have a slightly older policy that is being executed on some of the workers is in theory not a major problem. However there is another solution to the correlated samples problem that tends to be quite easy to use and works very well in practice: replay buffers.

Replay buffers

Let’s think back to the full fitted Q-iteration algorithm: we would collect the dataset of transitions using whatever policy we wanted and then we would make multiple updates on that on that dataset of transitions. So we would label all the transitions with target values, then we might make a large number of gradient steps regressing onto those target values and then we might even go back and compute new target values before we even collect more data:

The online Q-learning algorithm is simply the special case of this, when k is set to one and we take one gradient step in step three and of course any policy for collection will work so long as it has broad support. In fact, we could even omit step one altogether; we could just load data from the buffer. This view presents Q-learning as a kind of data driven method where we just have a big bucket of transitions, we don’t really care so much where these transitions came from so long as they cover the space of all possibilities pretty well and we’re just going to crank away taking more and more updates on those transitions, alternating between computing target values and regressing onto those target values. So we could still take one gradient step in step three if we want, and then we’d get something that looks a lot like online Q-learning, only without the data collection. This gives us a modified version of the Q-learning algorithm with a replay buffer.

In step one, instead of actually taking steps in the real world, we simply sample a batch of multiple transitions from our buffer and then in step two we sum up our gradient over all of the entries in that batch. So each time around this for loop, we might sample a different batch, IID (independently and identically distributed), from our buffer.

We have multiple samples in the batch, so we have a low variance gradient and our samples are no longer correlated, which satisfies the assumptions of stochastic gradient methods. We still unfortunately don’t have a real gradient, because we’re not computing the derivative through the second term, but at least the samples are not correlated. Where will the data come from? What we need to do is we need to periodically feed the replay buffer, because initially our policy will be very bad and maybe our initial very bad policy just won’t visit all the interesting regions of the space — so we still want to occasionally feed the replay buffer by using our latest policy, maybe with some epsilon greedy exploration to collect some data that achieves better coverage.

So, we have a buffer of transitions, on which we do off-policy Q-learning, and then periodically we deploy some policy back into the world, for example the greedy policy with epsilon greedy exploration, to collect a little bit more data and bring back some more transitions to add back to our buffer. That will refresh our buffer with behavior that hopefully gets better coverage.

Putting it all together, we can devise a full Q-learning recipe with replay buffers. Step one, collect a dataset of transitions, using some policy, and add this data to your buffer. Step two, sample a batch from B and then do some learning on that batch by summing over the batch this Q-learning pseudo-gradient. So, we’re going to calculate target values for every entry in the batch and then we’re going to do the current Q-value minus the target value times the derivative and then we’ll sum it over the whole batch. Since we sum over the whole batch, we get a lower variance gradient and since we sample the batch IID from our buffer, our samples are going to be decorrelated so long as our buffer is big enough. We can repeat this process k times.

Target networks

At this point we’ve almost developed a practical deep Q-learning algorithm that we could actually use, but there’s another component that we need to discuss to really get a stable and reliable procedure. We dealt with the problem of our samples being correlated by introducing a replay buffer — where we store all the data that we’ve collected so far and each time we have to take a step on the parameters of our Q function, we actually take that step using a batch of transitions that are sampled IID from our buffer.

But we still have this other problem to contend with, which is that Q learning is not gradient descent and it has a moving target. You could think of it as square error regression, except that the regression target itself changes all the time and it changes out from under us, which makes it very hard for the learning process to ever converge.

What does Q-learning really have to do with regression?

In the full fitted Q-iteration algorithm that I described before, step 3 performs what looks a lot like supervised learning, essentially regression onto target values.

And in fact, step three in the full fitted Q-iteration algorithm will converge, but then your targets will change, so maybe you don’t want it to converge. Essentially trading to convergence on the wrong targets isn’t necessarily a good thing to do, which is why in practice we often use a much smaller number of gradient steps as few as one gradient step — and then we have this moving target problem, where every gradient step our targets change and our gradient is not accounting for the change in the targets.

So, intuitively what we would like to resolve this issue is something that’s a little bit in between: where we could have some of the stability of the full fitted Q-iteration algorithm, but at the same time don’t actually train the convergence. Here’s how we can do Q-learning with a replay buffer and a target network — and this is going to look like a kind of mix between the online Q-learning algorithm and the full batch fitted Q-learning algorithm.

We’re going to collect our data set using some policy and we add that policy to our buffer (this is now step two, as step one will be revealed later). We’re going to then sample a batch from this buffer and then we will make an update on the parameters of our Q-function. This update looks a lot like the update from before with replay buffers, but now the parameters and the max are different parameter vectors. Then of course after I make k of these back and forth updates, I go out and collect more data. Then I have a big outer loop, where after n steps of data collection, I’m going to actually update φ’ and set it to be φ. This looks a lot like the fitted Q-iteration procedure I had before, because essentially I’m making multiple updates with the same target values (if φ’ stays the same, then the entire target value stays the same), except that I might still be collecting more data in that inner loop. In step two, the data collection is now inside the updates to φ’ and the reason for doing this is because in practice you often want to collect data as much as possible, whereas for stability you typically don’t want to update your target network parameters quite as often. So, k could be between 1 and 4, so we might take between 1 and 4 steps between each time we collect more data, but n might be around ten thousand, so it might take as many as ten thousand steps before we change our target values. And that’s to make sure that we’re not trying to hit a moving target, because it’s very hard to hit a moving target with supervised regression. Initially we initialize both φ and φ’ to be essentially a random initialization and then after the first n steps, we’re going to update φ’, set it to be equal to φ, but then φ’ will remain static for the next n steps. And that means that step four starts looking a lot more like supervised regression and for that reason step four is easier to do, it’s much more stable and you’re much more likely to get an algorithm that learns a meaningful Q-function. So, your targets don’t change in the inner loop and that means that essentially step two, three and four look a lot like supervised regression, with the only real difference being that you might collect more data and that data could be collected using your latest, for instance, epsilon greedy policy.

Based on this general recipe, one of the things we can derive is the kind of classic deep Q-learning algorithm, which is sometimes called DQN. This particular special case looks like this: in step one, take an action and observe the resulting transition and then add it to the buffer. So, this looks a lot like online Q-learning. In step two, sample a mini batch from your buffer uniformly at random, so this mini batch might not even contain the transition that you just took in the real world. In step three, compute a target value for every element in your mini batch and compute these target values using your target network Qφ’. In step four, update the current network parameters φ by essentially taking the gradient for the regression onto those target values. Now, notice that so far φ has not been used anywhere else, except maybe in step one if you’re using an epsilon gradient policy and then in step five, which is only done every n steps, to update φ’ by replacing φ’ or φ.

Something to note is that this procedure is basically a special case of the more general procedure at the top: you get this algorithm if you choose k=1. The numbering of the steps has been rearranged a little bit, but they are basically the same method.

There are some other ways to handle target networks that have been used in the literature and that could be worth experimenting with. There’s something a little strange about this way of updating target networks and here’s some intuition to illustrate some of the strangeness. Let’s say that I sampled my transition and then I updated my φ and then I sampled another transition and I updated my φ again. Blue boxes are step one, while green boxes are steps two, three and four:

Over here, on this step maybe my target network is obtained from the first step. So perhaps at the first step, when I started out, φ’ is equal to φ, so at the third step I get my target values back from the first step.

And at the second step I get them from the first step:

And at the fourth step I get them from the first step:

And then if the fourth step is where I update φ’ to be equal to φ (so basically if my n is equal to 4), then at the fifth step I get my φ’ from the preceding step. So it seems like at different steps the target values are lagged by very different amounts. If you’re at the step right after n, your target network is only one step old and if you’re right before a flip then it’s n — 1 step old. So it seems like at different points in time, your target values look more like moving target than others; if you’re right after that point where you set φ’ to φ, then your target really looks like a moving target and if it’s been a long time, then it really doesn’t look like a moving target. This is not actually that big of a problem, but it feels a little bit off. Then one common choice that you could do is you could use a different kind of update, which is kind of similar to Poliak averaging: a popular alternative is the set φ’ at every single step to be tau times the old φ’ plus 1 minus tau times the new φ.

So, you can think of this as if φ’ is gradually interpolating between its old value and the new value defined by φ. You would choose tau to be a pretty big number, so for instance you might choose it to be 0.999, which means that one part out of a thousand is essentially coming from φ and the rest is coming from φ’. Of course the caveat is that this only makes sense if φ’ is similar to φ; because you’re gradually making φ’ more and more similar to φ, this procedure is actually all right.

A general view of Q-learning

Here, I want to discuss another view that we can take of these Q learning algorithms that maybe provides a little bit more of a unified perspective.

Here is the general Q-learning with replay buffers and target networks that I discussed before:

And here is the fitted Q-iteration algorithm:

They’re really kind of the same thing and if you view them as having the same basic processes running in parallel at different rates, then all of these methods can be unified into one kind of parallel framework.

In the fitted Q-algorithm, the inner inner loop is just SGD and the DQN method we had before is a special case where n equals one and k equals one, but all of them are really just special cases of this more general view. We have our our data set of transitions, our replay buffer:

We periodically interact with the world and when we interact with the world, what we typically do is we take our latest vector φ, we construct some policy out of φ, for instance using epsilon-greedy or Boltzmann exploration, we send it out into the world and it brings back one or more transitions.

And you can think of this not as a discrete decision that we make periodically, but it’s a continuous process that is always running. All of this process is step 1: the data collection process, which takes steps in the environment and each step it takes it sends back to our replay buffer. Our replay buffer is a finite size, we can’t just keep adding stuff to it forever, so we also have another process, an eviction process, which periodically throws things out of the buffer when it gets too big:

A very simple and reasonable choice is to simply structure the replay buffer as a ring buffer, where the oldest thing in the buffer gets thrown out when a new thing gets added in. Then you have your target parameters φ’, which are used to compute those target values, and you have your current parameters φ, which are the ones that you’re going to give to process one in order to construct that epsilon greedy policy to collect more data. Between these two parameters, there is process two, which updates the target parameters and is typically very slow:

And then you have process 3, which is kind of the main learning process. It loads a batch of transitions from the replay buffer, it loads in the target parameters φ’, it uses the target parameters to calculate target values for every transition in the batch that was sampled, it uses that to update the current parameters φ and then saves them back out into the the current parameters:

This is a graphical depiction of a general Q-learning recipe that encompasses all of the algorithms we’ve discussed. All of them can essentially be instantiated as special cases of this general processes.

Online Q-learning is a special case where you evict immediately, meaning the size of your buffer is one. And then process one, two and three all run at the same speed and they all take one step sequentially. So process one takes one step, which means you collect one transition; process two takes one step, which means that your target values are always computed using the latest parameters; process three takes one step, which means that you make one gradient update.
The DQN algorithm that we mentioned before is also pretty similar: process one and process three run at the same speed, and then process two is very slow and the replay buffer is quite large (so you might store up to a million transitions). Part of why this starts looking so weird is that when your replay buffer is large, process one and process three are pretty heavily decoupled, because once it’s large enough the probability you’ll sample the transition that you just collected becomes pretty low. It does turn out that it’s actually quite important to collect data pretty quickly, so the performance of your Q-learning algorithm can degrade rapidly if you don’t collect data fast enough, but nonetheless process one and process three have quite a bit of buffer space between them.
The fitted Q-iteration algorithm can also be viewed as a special case of this. Process three is in the inner loop of process two, which itself is in the inner loop of process one. In this algorithm, you do your regression all the way to convergence, then you update your target network parameters and you might alternate these a few times; and then in the outer outer loop you pop all the way back out and collect more data

But these are really not all that different, they’re just particular choices about the rates at which we run all these different processes and there’s something of course a little bit deeper about this, because if process two and process one are completely halted, then process three is just faced with a standard convergent supervised learning problem. By varying the rates of these different processes, by making them all run at different rates, we’re essentially mitigating the effects of non-stationarity. Because if the rate of process two, for example, is very different from process three, then to process three (which is running a lot faster) it will kind of look like everything is almost stationary. And this is the deeper reason why having these three different processes running at different rates can help Q-learning algorithms converge more effectively.

Improving Q-learning

Let’s discuss some practical considerations hat we need to take into account when actually implementing Q-learning algorithms and then some improvements that can make them work a bit better.

Are our Q-values actually accurate?

We can think of Q-values as this kind of abstract object that we use inside reinforcement learning to help us improve our policy and get that argmax, but a Q-function is also a prediction: it’s a prediction about the total reward that you will get in the future if you start in a particular state-action and then follow your policy. So, it makes sense to ask whether these predictions are actually accurate predictions. Do they match up with the reality?

If we look at the basic learning curve where the x-axis is the number of iterations of Q-learning and the y-axis is the average reward per episode, and we look at it on a bunch of Atari games, we’ll see that for all of these Atari games our average per episode is going up. So, things are getting better. If we look at the average Q-values that are being predicted and that’s the two plots on the right, we’ll see that the Q-function is predicting larger and larger Q-values as training progresses. And that intuitively makes sense: as training progresses, our policy gets better, it’s getting higher rewards, so our Q-function should also predict that it’s getting higher Q-values.

Interestingly, the relative Q-values of different states and actions seem to make sense on a successful training run, but their actual absolute values will in practice not be very predictive of real values. In particular, the Q-function seems to systematically think it’s going to get larger rewards than it actually gets. This problem is sometimes referred to as overestimation in Q-learning and it has actually a fairly straightforward and intuitive reason.

Overestimation

Let’s look at how we’re computing our target values. When you compute your target value, you take your current target Q-function and you take the max of that Q-function with respect to the action. This max is the problem:

Now, let’s forget about Q-values for a minute and let’s just imagine that you have two random variables, x1 and x2. You could think that maybe x1 and x2 are normally distributed random variables, so they have some true value plus some norms. You can prove that the expected value of the max of x1 and x2 is always greater than or equal to the max of the expected value of x1 and the expected value of x2.

The intuition for why this is true is that when you take the max of x1 and x2 you’re essentially picking the value that has the larger noise and even though the noise for x1 and x2 might be zero mean, the max of two zero mean noises is not in general zero mean. You can imagine that one noise is positive or negative by fifty percent, the other is positive and negative by fifty percent, but when you take the max of the two, if either of them is positive, you’ll get a positive answer. So of course the probability that one of the two noises is positive is going to be pretty high. Now, if you imaeg that your Q-function has some noise, then when you take this max in the target value, you’re doing exactly this. Imagine that your Qφ’ for different actions represents the true Q-value of that action plus some noise, when you take the max in the target value then you’re actually selecting the positive errors and for the same reason that the expected value of the max of x1 and x2 is greater than or equal to the max of their expected values, the max over the actions will systematically select the errors in the positive direction, which means that it will systematically overestimate the true Q-values even if your Q-function initially does not systematically have errors that are positive or negative. For this reason, the max preferentially selects errors in the positive direction.

How can we fix this? If we think back to the Q-iteration, the way that we got this max was by basically modifying the policy iteration procedure. We had our greedy policy and we then sent that argmax back into our Q-function to get its value. This is just another way of saying that:

This is actually the observation that we’re going to use to try to mitigate this problem. See, the trouble is that we select our action according to Qφ’, so if Qφ’ erroneously thinks that some action is a little bit better because of some noise, then that’s the action we’ll select and then the value that we’ll use for our target value is the value for that same action which has that same noise. But if we can somehow decorrelate the noise in the action selection mechanism from the noise in the value evaluation mechanism, then maybe this problem can go away. So, the problem is that the value also comes from the same Qφ’, which has the same noise as the rule that we used to select our action.

Double Q-learning

If the function that gives us the value is decorrelated from the function that selects the action, then in principle this problem should go away. So the idea is to just not use the same network to choose the action as the network that we use to evaluate the value. Double Q-learning uses two networks. One network which we’re going to call φa and another network which we’re going to call φb. φa uses the values from φb to evaluate the target values, but selects the action according to φa. So if you assume that φb and φa are decorrelated, then the action that φa selects for the argmax will be corrupted by some noise, but that noise will be different from the noise that φb has, which means that when φb evaluates that action, if the action was selected because it had a positive noise, then φb will actually give it a lower value. So the system will be kind of self-correcting and then analogously φb is updated by using φa as its target network by using φb as the actual selection rule. This is the essence of double Q-learning .

In practice, the way that we can implement double Q-learning is without actually adding another Q-function, but actually using the two Q-functions we already have. We can use the current and target networks.

If this is standard Q-learning:

This is double Q-learning:

We select the action using Qφ, but evaluate it using Qφ’. As long as φ’ and φ are not too similar, then these will be decorrelated. You could say that we do still have a little bit of the moving targets problem, because as our φ changes so does our action, but presumably the change in the argmax is a very sudden discrete change and it doesn’t happen all the time, so if you have three different actions, the argmax isn’t going to change as often. Of course, φ’ and φ are not totally separate from each other, because periodically you do set φ’ to be equal to φ, but in practice it actually tends to work pretty well and it mitigates a large fraction of the problems with overestimation.

Multi-step returns

There’s another trick that we can use to improve Q-learning algorithms and it’s similar to something that we saw in the actor-critic article. It’s the use of multi-step returns. This is our Q-learning target:

Where does the signal come from? If your initial Q-function is very bad, then almost all of your learning has to come from the reward and the second term is just contributing noise. On the other hand, if the Q-function is good, the target values do mostly heavy lifting. Early on in training your Q-function is pretty bad, so almost all of your learning signal really comes from the reward. Later on in training, your Q-function becomes better and the Q-values are much larger in magnitude than the rewards, so later on in training the Q-values dominate. But your initial learning period can be very slow if your Q-function is bad, because this target value is mostly dominated by the Q-value. This is quite similar to what we saw in actor-critic when we talked about how the actor critic style update that uses the reward plus the next value has lower variance but it’s not unbiased, because if the value function is wrong then your advantage values are completely messed up. And Q-learning is the same way: if the Q-function is wrong, then your target values are really messed up and you’re not going to be making much learning progress. The alternative that we had in the actor-critic lecture is to use a Monte Carlo sum of rewards, because the rewards are always the truth; they’re just higher variance, because they represent a single sample estimate. We can use the same basic idea in Q-learning. So, Q-learning by default does this kind of one step back up, which has maximum bias and minimum variance, but you could construct a multi-step target just like an actor critic:

The way that you can construct a multi-step target is by not just using one reward, but making a little sum from t’ = t to t + n — 1. You can verify that if n = 1, then you recover exactly the standard rule that we had for Q-learning, but for n larger than one you sum together multiple reward values and then you use your target network for the t + n step multiplied by gamma to the n. So this is sometimes called an n step return estimator, because instead of summing the reward for one step, you sum it for n steps. Just like with actor-critic, the trade-off is that it gives you a higher variance, because of that single sample estimate for r, but lower bias, because even if your Q-function is incorrect, now it’s being multiplied by gamma to the n and for large values of n, gamma to the n might be a very small number. So let’s talk about Q-learning with these n n-step returns: it’s less biased because the Q-values are multiplied by a small number and it’s typically faster early on, because when the target values are bad, those sums of rewards really give you a lot of useful learning signal. Unfortunately, once you use n step returns, this is actually only a correct estimate of the Q-value when you have an on-policy sample. The reason for this is that if you have a sample collected with a different policy, then that second step t+1 might actually be different for your new policy, because if on the second step you take a different action, that will match what you’re getting from your instant return. So, n step returns technically are not correct without policy data, whereas with off policy data technically you’re only allowed to use n equals one. Therefore, it is only correct when learning on-policy, because off-policy you end up using the action from the sample for those intermediate steps, which is not the action that your new policy would have taken. How can we fix this?

We can ignore the problem
We can dynamically cut the trace, so we can choose n to only get on policy data. Essentially, we can look at what our deterministic greedy policy would do, we could look at what we actually did in the sample and we can choose n to be the largest value such that all of the actions exactly match what our policy would have done and that will also remove the bias. But this works well when data is mostly on policy and the action space is pretty small.
Another thing we can do is importance sampling: we can construct a stochastic policy and importance weight these n-step return estimators.
You condition the Q-function with some other additional information that allows you to make it off policy.

Q-learning with continous actions

It is actually possible but somewhat more complicated to extend Q-learning procedures to the case when we have continuous actions.

What’s the problem with continuous actions? Well, the problem is that when you select your actions, you need to perform the argmax over actions.

And doing it over discrete ones is straightforward: you simply evaluate the Q-value for every possible action and take the best one. But when you have continuous actions, it’s much harder. It is harder in two places: a) when evaluating the argmax policy and when computing the target value, which requires the max (or in the case of double Q-learning also an argmax).

Moreoveor, the target value max is particularly problematic because that happens in the inner loop of training, so you really want to be very fast and very efficient. So how can you perform this max when you have continuous actions? There are three choices:

Option 1: optimization

Option one is to use a continuous optimization procedure, like for instance gradient descent. Gradient descent by itself can be pretty slow, because it requires multiple steps creating calculations and it happens in the inner loop of an outer loop learning procedure; so, there are better choices that we could use. Plus, our action space is typically low-dimensional, so in some sense it presents a slightly easier optimization problem than the kind of problems we typically take on with SGD.

So, it turns out that for evaluating the max with optimization, a particularly good choice is to use a derivative free stochastic optimization procedure. A very simple solution is to simply approximate the max over a continuous action as the max over a discrete set of actions that are sampled randomly.

For instance, you could sample a set of n actions, maybe uniformly at random, from the set of valid actions and then take the Q-value with the largest of those actions. That’s not going to give you an exact max, it’ll give you a very approximate max, but if your action space is pretty low-dimensional and you can bombard it with enough samples, this max might actually be pretty good. And of course if overestimation is your problem, this might actually suffer from overestimation less, because the max is less effective. This also has the advantage of being dead simple, it’s very efficient to parallelize, because you can essentially use your favorite deep learning framework and just treat these different actionsas different points in a mini batch and evaluate all of them in parallel. The problem is that it’s not very accurate, especially as the action space dimensionality gets larger. This random sampling method just doesn’t actually give you a very accurate max.

If you do want a more accurate solution, there are better algorithms are based on basically the same principle.

Cross-entropy method (CEM) is a simple iterative stochastic optimization scheme and it consists of sampling sets of actions just like in the simple solution above, but then instead of simply taking the best one, cross-entropy method refines the distribution from which you sample to then sample more samples in the good regions and then repeat. And this can also be a very very fast algorithm if you’re willing to parallelize and you have a low dimensional action space.
CMA-ES: you can kind of think of it as a much fancier version of CEM, which is substantially less simple, but structurally very similar. And these methods work okay for up to about 40 dimensional action spaces, so if you use one of these solutions you simply plug this in place of your argmax to find the best action and the rest of your Q-learning procedure stays basically the same.

Option 2: function class that is easy to optimize

Alternatively, you can use a function class that is inherently easy to optimize. So, arbitrary neural networks are not easy to optimize with respect to one of their inputs, but other functional classes have closed form solutions for their optima. One example of such a function class is the quadratic function; you could for example express your Q-function as a function that is quadratic in the action and the optimum for quadratic is a closed form solution.

Pne of the ways you could do this is with something called the Normalized Advantage Function (NAF). You have a neural network that outputs three quantities: a scalar valued bias, a vector value and a matrix value. The vector and matrix together find a quadratic function in the action, so this function is completely non-linear in the state, it can represent any function of the state, but for a given state the shape of the Q-value in terms of the action is always quadratic and when it’s always quadratic, then you can always find the maximum. In this case the maximum is just μφ of f. The reason that it’s called “Normalized” is that if you exponentiate it, then you get a normalized probability distribution. Now we’ve just made this maximization operation very easy at the cost of reducing the representational capacity of our Q-function, because if the true Q-function is not quadratic in the action, then of course the problem we have is that we can’t represent it exactly. So there’s no change to the algorithm, it’s just as efficient as Q-learning, but it loses some representational power

Option 3: learn an approximate maximizer

This is going to be a little bit similar to option one, only instead of running the optimization separately for every single argmax that we have to take, we’ll actually train a second neural network to perform the maximization. Remember that:

So as long as we can do that argmax, we can perform Q-learning. So the idea is to train another network μφ(s) such that μφ(s) is approximately the argmax of Q-φ. And you can also think of μφ(s) as a kind of policy. because it looks at a state and outputs the action, specifically the argmax action. To do this, we just solve for theta:

Now these are our targets:

The algorithm would thus look like this:

First, we take some action and we get and observation that we add to the buffer. Then, we sample a mini-batch from the buffer uniformly at random. Step three, we compute the target value by using μφ and target nets Qφ’. Step four, just like in Q-learning, we perform a gradient update on φ. Step five, we perform a gradient update on theta. The gradient on theta takes the derivative of the Q-value with respect to the action and then multiplies that by the derivative of the action with respect to theta, which is just back propagation through μ. Then, we’re going to update our target parameters φ’ and theta prime, for example using Polyak averaging. Then, we’ll repeat this process.

Implementation tips and examples

Q-learning methods are generally quite a bit more finicky to use than policy gradient methods, so they tend to require a little bit more care to use correctly. It takes some care to stabilize Q-learning algorithms and what i would recommend is to start off by testing your algorithms on some easy, reliable problems, where you know that your algorithm should work. This is just to make sure your implementation is correct, because essentially you have to go through several different phases of troubleshooting. You first have to make sure that you have no bugs, then you have to make sure that you tune your hyperparameters and then get it to work on your real problems, so you want to do the debugging before the hyperparameter tuning, which means that you want to do it on really easy problems, where basically any correct implementation should really work. Q-learning performs very differently on different problems.

Large replay buffers do tend to help to improve stability quite a lot, so using a replay buffer of a size about 1 million can be a pretty good choice. And at that point the algorithm really starts looking a lot more like fitted-Q iteration,which is perhaps part of the explanation for its improved stability. Lastly, Q-learning takes a lot of time, so be patient: it might be no better than random for a long time while that random exploration finds the good transitions and then it might take off once those good transitions are found. Start with high exploration, start with large values of epsilon and then gradually reduce exploration as you go, because initially your Q-function is garbage anyway, so it’s the random exploration that will be doing most of the heavy lifting, and then later on once your Q-function gets better, then you can decrease epsilon — so it often helps to put on a schedule.

More advanced tips for Q-learning

The gradients of the Bellman error can be very big; something that’s a little troublesome is that if you have a really bad action, you don’t really care about the value of that action, but your squared error objective really cares about figuring out exactly how bad it is. So if you have some good actions that are like +10, +9, +8 and you have some bad actions that are -1.000.000, the bad actions will create a huge gradient although you really care that’s -1.000.000. In other words, it could also be 1.000 and it would result in the same policy, but your Q-function objective really cares about that and that will result in big gradients. What you can do is you can either clip your gradients or you can use what’s called a Huber loss:

It is an interpolation between a squared error loss and an absolute value loss, so far away from the from the minimum the Huber loss looks like absolute value and close to the minimumit flattens out with a quadratic.

Double-Q learning helpts a lot in practice, it is simple and it has no downsides.
N-step returns also help a lot, but have some downsides: they will systematically bias your objective, especially for larger values of n.
Schedule exploration and schedule learning rates, as well as Adam optimization can help too.
When debugging your algorithm make sure to run multiple random seeds, because you’ll see a lot of variation between random seeds, so you should run a few different random seeds to make sure that things are really working the way you expect.

Summary

We talked about Q-learning in practice, how we can use replay buffers and target networks to stabilize it; we talked about a generalized view of fitted-Q iteration in terms of three processes, we talked about how double Q-learning can make Q-learning algorithms work a lot better; we talked about multi-step Q-learning and finally about how to make Q-learning work with continuous actions.

Feel free to drop me a message or:

Connect and reach me on LinkedIn and Twitter
Follow me on Medium