4 Value function methods

Samuele Bolotta
24 min readMay 26, 2022

This is the fourth of a series of articles in which I summarize the lectures from CS285 held by Professor Sergey Levine, to whom all credit goes. All images are taken from his lectures. This article I wrote is an introduction to deep reinforcement learning. Actor-critic algorithms build on the policy gradient framework that we discussed in this article. On top, they’re also augmented with learned value functions and Q-functions.

Can we omit the policy gradient entirely?

What if we just learn a value function and then try to use that value function to figure out how to act? The intuition for why this should be possible is that the value function tells us which states are better than others, so if we simply select actions that go into the better states maybe we don’t need an explicit policy neural network anymore.

Here’s the the way to make this intuition a bit more formal. The advantage is:

It is the difference between our Q value and our value and intuitively the advantage says how much better this action is than the average action according to the policy.

Then arg max is the best action that we could take if we followed the policy thereafter. This means that the argmax is going to be at least as good as an action that we would have sampled from our current policy. We know it’s at least as good because it’s actually the best. The interesting thing is that this is true regardless of what the policy actually is. This arg max should immediately suggest to us that regardless of which policy we had before, even if it was a very bad random policy, we ought to be able to improve it by selecting the action according to the argmax of the advantage.

Maybe we could forget about representing policies explicitly and we could just use this argmax. This is the basis for value-based methods. We will construct new policies implicitly, so at every iteration we can construct a new policy that assigns a probability of one to the action a if it is the argmax of the advantage, and 0 otherwise.

So now instead of taking a gradient ascent step on an explicit policy, we will construct this implicit policy as the arg max, so there’s no actual learning that happens.

This is called policy iteration. It’s called policy iteration because we iterate between evaluating the policy in step one and updating the policy in step two. Step two is pretty straightforward, but the big puzzle is really how to do step one. How do you evaluate the advantage for a particular state-action tuple for a given previous policy?

Like before we can express the advantage as the reward plus gamma times the expected value of the value at the next timestep minus the value at the current timestep. Let’s evaluate the value.

Dynamic programming

One way to evaluate the value in order to then estimate these advantages for policy iteration is to use dynamic programming. For now let’s assume that we know the transition probabilities and let’s assume that state and action are both discrete and small. This is not the setting we usually operate in in model-free RL, but we’ll assume that that’s our setting for now just so that we can derive the simple dynamic programming algorithm and then turn it into a model free algorithm later.

We could imagine that we can essentially enumerate our entire state and action space — we can represent it with a table. For instance, in this gridworld you have 16 states and 4 actions per state.

Your transition probabilities T are represented by a 16 by 16 by 4 tensor. So, when we say we’re doing tabular reinforcement learning or tabular dynamic programming, what we’re really referring to is a setting like this.

Now we can write down the bootstraped update for the value function that we saw in the previous article, in terms of these explicit known probabilities. If we want to update the value of the state, we can set it to be the expected value with respect to the action sampled from the policy of the reward plus gamma times the expected value over the next state sampled from the transition probabilities of the value of the next state.

If you have a tabular MDP, meaning you have a small discrete state space and you know the transition probabilities, each of the expected values can be computed by summing over all values of that random variable and multiplying the value inside the parentheses by its probability. Then of course we need to know the value of the next state, so we’re just going to use our current estimate of the value function for that value. Once we calculate a value function this way, then we can construct a better policy pi prime by assigning probability 1 to the action that is the argmax of the advantage. This also means that our policy will be deterministic, so expected values with respect to this policy will be pretty easy to compute. We can simplify our bootstrapped update by removing the expectation with respect to the policy and just directly plugging in the only action that has non-zero probability:

Now we can plug this procedure into our policy iteration algorithm, specifically in step 1, when we evaluate the value of the state. You can prove that repeating this recursion eventually converges to a fixed point and this fixed point is the true value function.

Even simpler dynamic programming

You probably notice the way policy evaluation works: values propagate consistently on each iteration, but slowly. But even if we truncated policy evaluation after a single iteration, we could still improve upon the initial policy by taking the greedy policy of the Q-function estimation after
a single state-space sweep of policy evaluation. This algorithm is another fundamental algorithm in RL: it’s called value iteration (VI). VI can be thought of “greedily greedifying policies,” because we calculate the greedy policy as soon as we can, greedily. VI doesn’t wait until we have an accurate estimate of the policy before it improves it, but instead, VI truncates the policy-evaluation phase after a single state sweep.

Mathematically, the idea is that the argmax of the advantage is equal to taking the argmax of the Q function. Because if you remove the value of the state, which you can do since the argmax is with respect to the action, then you just get the Q-function.

Instead of getting the index from the argmax of the advantage and then plug it in the Q function to get the highest value, we can directly skip the step where we recover the indices and take the values.

In step one, we set the Q values to be the reward plus the expected value of the value function of the next timestep and then in step two we set the value function to be the max over the action in the Q function table. Here, explicit policy computation is skipped, we don’t ever actually represent the policy explicitly, but you can think of it as showing up implicitly in step two because setting the value to be the max over the actions in the Q value table is analogous to taking the argmax and then plugging the index of the argmax into the table to recover the value. But since taking the argmax and then plugging it into the table is the same as just taking the max, we can basically short circuit that step and get this procedure

Fitted value iteration and Q-iteration

So far, we talked about how we can learn value functions represented in a tabular form, so there’s no neural net. Now let’s talk about how we can introduce neural networks and function approximation

First, how do we represent the value? So far, we talked about how we can represent it as a big table. However, let’s say that you’re playing a videogame from images: in this video game the number of possible states, if you have a 200 by 200 pixel image, is 255, which is the number of values that each pixel can take on, raised to the third power because there are three color channels, raised to the power 200 times 200. Maintaining a table over these many entries is impossible: this is more than the number of atoms in the universe. This is known as the curse of dimensionality: the simple fact that if you have a state space with many dimensions, the number of entries that you need in a table for tabular reinforcement learning is exponential in the number of dimensions.

So we’ll use a function approximator. We’re going to have a neural net value function that maps from states to scalar valued values. We can fit our neural net value function by doing least squares regression onto target values — and if we use the value iteration procedure from the previous section, then our target values are just the max over the action of the Q function.

Then our fitted value iteration algorithm would look like this:

Step one, compute your target values by constructing the Q function for every possible action at each sampled state (we still assume that we have a discrete action space so we can perform this enumeration exactly); for every action we value its reward plus gamma times the expected value of the value at the next state. Do the max over that, that gives us our target value and then in step two regress onto those target values.

This is a reasonable algorithm that we could use, but it still requires us to know the transition dynamics:

There are two ways in which this requires knowledge of the transition dynamics. First, it requires being able to compute that expected value and perhaps more importantly it requires us to be able to try multiple different actions from the same state, which we can’t do in general if we can only run policies in the environment instead of teleporting to a state multiple times and trying multiple actions from the same exact state

Let’s go back to policy iteration. In policy iteration, we alternated between evaluating the Q function and then setting the policy to be the greedy argmax policy. Step one in policy iteration involved policy evaluation, which required repeatedly applying this value function recurrence so we saw before. So what if instead of applying the value function recurrence to learn the value function, we instead directly constructed a Q function recurrence in an analogous way?

If I wanted to construct the Q function at a particular state-action tuple, I could write exactly the same recurrence, except that now since the Q function is a function of a state and an action, I don’t need to evaluate the next state given the state and the policy, I just evaluate the next state given the state-action tuple that I’m training my Q function on. This might at first seem like a very subtle difference, but it’s a very important one, because now as my policy changes, the action for which I need to sample the next state doesn’t actually change, which means that if I have a bunch of samples (state, action, next state), I can use those samples to fit my Q function regardless of what policy I have. The only place where the policy shows up is as an argument to the Q function at the next state, inside of the expectation. And it turns out that this very seemingly very simple change allows us to perform policy iteration style algorithms without actually knowing the transition dynamics, just by sampling some tuples which we can get by running any policy we want. If we do this for step one in policy iteration, we would no longer require knowing the transition probabilities.

Can we do the “max” trick again?

Now we we seemingly took a step back, because before we derived policy iteration and then we simplified it to get value iteration, and the way that we got value iteration is by using this max trick. In value iteration, we saw that when we construct the policy we take the argmax, but then we simply take the value of that argmax action, so evaluating the value of the arg max is just like taking the max — so we can forego that step two and directly perform value iteration.

Can we do the same max trick with Q functions? This is what we did before:

The way that we construct the fitted Q iteration algorithm is very much analogous to fitted value iteration. We construct our target value as the reward plus gamma times the expected value of the value function at the next state. Then, in step two we simply regress our Q function onto those target values:

The trick of course is that we have to evaluate step one without knowing the transition probabilities. We’re going to do two things: first, we’re going to replace the value of the next state with the max over the action at the Q function:

This is because we’re only approximating the Q function and not the value function.

Second, instead of taking a full expectation over all possible next states, we’re going to use the sampled state that we got when we generated that sample. And now all we need to run this algorithm is samples (state, action, next state), which can be constructed by rolling out our policy. This doesn’t require simulation of different actions, it only requires the actions that you actually sampled last time when you ran your policy.

This works even for off policy samples, so this algorithm does not make any assumptions that the actions were actually sampled from the latest policy, the actions could have been sampled from anything. You can store all the data you’ve collected so far, it doesn’t need to come from your latest policy.

Unfortunately, it turns out this procedure does not have any convergence guarantees for non-linear function approximation; however, if you use a tabular representation it is actually guaranteed to converge.

To put the pieces together, this is the fitted Q-iteration algorithm:

The algorithm works for a wide range of different policies. One of the parameters you have to choose is the number of such transitions you are to collect.

Step two, for every transition that you sampled, calculate a target value.

Step three, train a new Q function, which means find a new parameter vector by minimizing the difference between the values of Q and the corresponding target value. One parameter you have to choose is the number of gradient steps that you will make in performing this optimization. Doing step three once doesn’t actually get you the best possible Q function; you could alternate step two and step three some number of times before going out and collecting more data.

From Q-iteration to Q-learning

Let’s talk a little bit more about what it means for fitted Q-iteration to be an off-policy algorithm. Just as a reminder: off-policy means that you do not need samples from the latest policy in order to keep running your RL algorithm. Typically, what that means is that you can take many gradient steps on the same set of samples or reuse samples from previous iterations, so you don’t have to throw out your your old samples and you can keep using them — which in practice gives you more data to train on.

Intuitively, the main reason that fitted-Q iteration allows us to get away with using off-policy data is that the one place where the policy is used is utilizing the Q function rather than stepping through the simulator. So as our policy changes what really changes is this max:

Remember that the way that we got this max was by taking the argmax and then plugging it back into the Q value to get the actual value for the policy. So inside of that max you can kind of unpack it and pretend that it’s the Q function of the next state and of the argmax of Q. And that argmax is basically our policy, so this is the only place where the policy shows up and conveniently enough it shows up as an argument to the Q function, which means that as our policy changes, we do not need to generate new rollouts — you can almost think of this as a kind of model. The Q function allows you to sort of simulate what kind of values you would get if you were to take different actions and then of course you take the best action if you want to most improve your behavior. This max approximates the value of our greedy policy at the next state.

One way that you can think of fitted Q iteration structurally is that you have this big bucket of different transitions and what you’ll do is you’ll back up the values along each of those transitions and each of those backups will improve your Q value, but you don’t really care so much about which specific transitions they are so long as they kind of cover the space of all possible transitions quite well. So you could imagine you have this dataset of transitions and you’re just plugging away on this dataset, running fitted Q iteration, improving your Q function each time you go around the loop.

What is fitted Q-iteration optimizing?

The step where you take the max improves your policy, so in the tabular case this would literally be your policy improvement. Your step three is minimizing the error of fit, so if you have a tabular update you would just directly write those targets into your table

But since you have a neural network, you have to perform some optimization to minimize an error against those y’s. You could think of fitted Q-iteration as optimizing an error, the error being the Bellman error — the difference between the Q function and the targets. Tat is kind of the closest to an actual optimization objective, but of course that error itself doesn’t really reflect the goodness of your policy, it’s just the accuracy with which you’re able to copy your target values. If the error is zero, then you know that this is an optimal Q-function, but if the error is not zero, then you can’t really say much about the performance of this policy.

Now let’s discuss a few special cases.

Online Q-learning algorithms

You can instantiate a special case of the previous algorithm with particular choices for those hyperparameters which correspond to an online algorithm. In the online algorithm, you take exactly one action and observe one transition; then in step two , you compute one target value for the transition that you just took, very much analogous to how you calculate the advantage value in online actor-critic for the one transition that you just took; then in step three you take one gradient descent step on the error between your Q values and the target value that you just computed.

The equation that I have here looks a little complicated but I basically just applied the chain rule of probability to that objective inside the argmin in step three. The error in those parentheses is sometimes referred to as the temporal difference error.

Exploration with Q-learning

While learning is progressing, using the greedy policy may not be such a good idea. The reason is our argmax policy is deterministic and if our initial Q function is quite bad, it’s not going to be random, but it’s going to be arbitrary. Then it will essentially commit our argmax policy to take the same action every time it enters a particular state and if that action is not a very good action we might be stuck taking that bad action essentially in perpetuity and we might never discover that better actions exist. So in practice, when we run fitted Q iteration it’s highly desirable to modify the policy that we use in step one to not just be the argmax policy, but to inject some additional randomness to produce better exploration. There are a number of choices that we make in practice to facilitate this.

One common choice is called epsilon greedy; this is one of the simplest exploration rules that we can use with discrete actions and it simply says that with probability 1 minus epsilon you will take the greedy action and then with probability epsilon you will take one of the other actions uniformly at random.

If we choose epsilon to be some small number, it means that most of the time we take the action that we think is best and that’s usually a good idea — but we always have a small but non-zero probability of taking some other action which will ensure that if our Q-function is bad, eventually we’ll just randomly do something better. A very common practical choice is to vary the value of epsilon over the course of training and that makes a lot of sense because you expect your Q-function to be pretty bad initially and at that point you might want to use a larger epsilon. And then as learning progresses your Q-function gets better and then you can reduce epsilon

Another exploration rule that you could use is Boltzmann exploration, to select your actions in proportion to some positive transformation of your Q values. A particularly popular positive transformation is exponentiation.

So if you take actions in proportion to the exponential of your Q values what will happen is that the best actions will be the most frequent and actions that are almost as good as the best action will also be taken quite frequently, because they’ll have similarly high probabilities. But if some action has an extremely low Q-value then it will almost never be taken. In some cases this kind of exploration rule can be preferred over epsilon greedy, because 1) With epsilon greedy the action that happens to be the max gets much higher probability and if there are two actions that are about equally good, the second best one has a much lower probability, whereas with this exponentiation rule if you really have two equally good actions, you’ll take them about an equal number of times 2) If you have a really bad action and you’ve already learned that it’s just a really bad action, you probably don’t want to waste your time exploring it, whereas epsilon greedy won’t make use of that.

Summary

We’ve discussed value-based methods, which don’t learn a policy explicitly, but just learn a value function or Q-function. We’ve discussed how if you have a value function, you can recover a policy by using the argmax and how we can devise this fitted Q-iteration method, which does not require knowing the transition dynamics — so it’s a true model-free method and we can instantiate it in various ways as a batch mode, off-policy method or as an online Q-learning method.

Value functions in theory

Let’s now delve into a bit of theory to explain what I meant before when I said that value-based methods with neural networks don’t in general converge to the optimal solution. To get started, let’s start with the the value iteration algorithm that we covered before.

It’s a pretty simple algorithm and it’s a little easier for us to think about, but we’ll get back to the Q-iteration methods a little bit later. To remind everybody, in value iteration we can think of it as having two steps. Step one, construct your table of Q values as the reward plus gamma times the expected value at the next state; step two, set your value function to be the max over the rows of that table. So you can think of it as constructing a table of values and then iterating this procedure. The question we could ask is: does this algorithm converge? And if it does converge, what does it converge to? One of the ways that we can get started with this analysis is, we can define a Bellman operator:

When applied to a value function, this operator performs the following operations:

  • First, it takes V and applies the operator T subscript a, which is a matrix with dimensionality SxS, where every entry in that matrix is the probability of the next state given the current state and action and where the action is chosen according to that max. So this is basically computing that expectation, which is a linear operator.
  • We multiply it by gamma
  • We add the vector r subscript a, which is a vector of rewards where for every state you pick the reward for the corresponding action and then outside of this you perform a max over the action. Crucially, this max is per element, so for every state we take a max

This funny way of writing the the Bellman backup basically just captures the value iteration algorithm, which consists of repeatedly applying the operator B to the vector V

One interesting property that we can show is that v star (the value function for the optimal policy) is a fixed point of B.

If we find a value function that satisfies this equation, we have the optimal value function and if we use the arg max policy with respect to that, we’ll get the optimal polic, the policy that maximizes total rewards. That means that V star is equal to B times V star. That’s very nice: if we find a fixed point of B, then we’ll have the optimal value function. And furthermore it’s possible to show that v star always exists. This fixed point always exists, it’s always unique and it always corresponds to the optimal policy; so the only question that we’re left with is: does repeatedly applying B to V actually find this fixed point? In other words, does the fixed point iteration algorithm converge? If it does converge, it will converge to the optimal policy and it has a unique solution. I won’t go through the proof in detail, but the high-level sketch behind how we argue that value iteration converges is by arguing that it’s a contraction. We can prove that value iteration reaches v star because B is a contraction. What does it mean to be a contraction? It means that if you have any two vectors v and v bar, then applying B to both v and v bar will bring those vectors closer together, meaning that the norm of BV minus B * V bar is less than or equal to the norm of V minus V bar. In fact, it’s a contraction by some coefficient; that coefficient happens to be gamma, so not only is the norm of B*V minus B * V bar less than or equal to the norm of V minus V bar norm, but it’s also less than or equal to the norm V minus V bar times gamma. So, you will contract and you’ll contract by some non-trivial amount, which means that V and V bar will always get closer together as you apply B to them

Non-tabular value function learning

Regular value iteration can be written extremely concisely as just repeatedly applying this one step: V goes to BV.

Now, let’s go to the fitted value iteration algorithm. The fitted value iteration algorithm has another operation; it has a step two where you perform the arg min with respect to the parameters. One of the ways you can think of supervised learning is that you have some set of value functions that you can represent. That set is a continuous set that consists of all possible neural nets with your particular architecture, but with different weight values. So we’ll denote that set as a set omega. In supervised learning, we sometimes refer to this as the hypothesis set or the hypothesis space. Supervised learning consists of finding an element in your hypothesis space that optimizes your objective and our objective is the square difference between the value of the state and our target value. Now, what is our target value? Our target value is basically BV. You can think of the entire fitted value iteration algorithm as repeatedly finding a new value function v prime, which is the argument inside the set omega of the squared difference between v prime and BV, where BV is your previous value function

This procedure is itself also a contraction. When you perform this supervised learning, you can think of it as a projection in the L2 norm. You have your old v, you have your set of possible neural nets represented by this line:

Omega is basically all the points on that line. When we construct BV, we might step off this line, so the point BV doesn’t lie in the set omega. When we perform supervised learning, when we perform step two of fitted value iteration, what we’re really doing is we’re finding a point in the set omega that is as close as possible to BV. And “as close as possible” means that it’s going to be at a right angle, so we’ll project down onto the set omega and it’ll be a right angle projection. That’ll get us v prime.

We can define this as a new operator. We can call this operator Π for “projection” and we’re going to say that ΠV is just the argmin within the set omega of this objective, which is the L2 norm.

B is a contraction with respect to the infinity norm. Π is a contraction with respect to the L2 norm.

The reason why Π is a contraction is if you have any two points in euclidean space and project them on a line, they can only get closer to each other and they can never get further. Unfortunately, ΠB is not a contraction of any kind.

If you imagine that the blue dot is your starting point and the yellow star is the optimal value function:

You take a step, so your regular value iteration will gradually get closer and closer to the star:

If you have a projected value iteration algorithm, a fitted value iteration algorithm, then you’re going to restrict your value function to this line each step of the way, while your Bellman backup will get you closer to the star in terms of infinity norm:

Then your projection will move you back onto the line:

And while both of those operations are contractions, notice that V prime is now actually further from the star than V is. And you can get these situations where each step of the way actually gets you further and further from V star.

The sad conclusions from all this are that:

  • Value iteration does converge in the tabular case
  • Fitted value iteration does not converge in general and it often doesn’t converge in practice
  • Fitted Q-iteration is the same: ΠB is not a contraction of any kind.

Importantly, some might think that step 3 in this algorithm might be just gradient descent, which converges:

However, Q-learning is not gradient descent. It is not taking gradient steps on a well-defined objective and it’s because the target values in Q-learning themselves depend on the Q values — but you’re not considering the gradient through those target values. So the gradient that you’re actually using is not the true gradient of a well-defined function, and that’s why it might not converge. Now it’s probably worth mentioning that you could turn this algorithm into a gradient descent algorithm, by actually computing the gradient through those target values. The bigger problem is that the resulting algorithm is called the residual algorithm and it has very poor numerical properties and doesn’t work very well.

A sad corollary

Our actor-critic algorithm that we discussed before also is not guaranteed to converge on the function approximation for the same reason. There, we also do a Bellman backup, when we use a bootstrapped update and we do a projection when we update our value function. The concatenation of those is not a convergent operator, so fitted bootstrapped policy evaluation also doesn’t converge.

Summary

We talked about some value iteration theory, we discussed the operator for the backup, the operator for the projection; we talked about how the backup is a contraction and our tabular value iteration converges. We talked about some convergence properties of function approximation, where the projection is also a contraction, but because it’s a contraction in a different norm, backup followed by projection is not actually a contraction — and therefore fitted value iteration does not in general converge. And its implications for Q learning. We will find out in the next article that in practice, we can actually make all these algorithms work very well, but their theoretical properties leave us with a lot to be desired.

Feel free to drop me a message or:

  1. Connect and reach me on LinkedIn and Twitter
  2. Follow me on Medium

--

--

Samuele Bolotta

PhD student in Machine Learning — also interested in Neuroscience and Cognitive Science.