What if we have a variable size input?
A sequence input can be of variable length. For example, they might be sentences in English (the first sequence has four elements, the second one has three elements and the third one has five elements):
This could involve:
- Sequence of words → The problem of sentiment classification: you have a review online and you want to guess whether this is a positive or a negative review.
- Sequence of sounds → The problem of recognizing a phoneme from a sound
- Sequence of images → The problem of classifying the activity in a video
You need a network that can:
- Accommodate multiple inputs and multiple different numbers of inputs
- Share features
The number of layers becomes the same as the number of inputs. In other words, you get a separate input for every layer:
If you have a sequence of length four, then you’ll have four layers; if you have a sequence of length three, then you’ll have three layers; if you have a sequence of length five, then you’ll have five layers.
At each layer, we have two inputs now:
- The activations of the previous layer
- The new input x
Then we apply a regular linear operation to this:
As usual, we apply a non-linearity:
This is very much like the standard linear layers; the only difference is that, instead of applying the linear layer to the previous layer activations, now you concatenate the new input to those previous layer activations and then pass them through the linear and non-linear operations.
What happens to the missing layers? If you have a sequence of length five, then you’re gonna have five layers, whereas if you have a sequence of length three, then you’re gonna have three layers.
Here’s a funny trick: if we have a shorter sequence, we’ll just pretend that the activations at the layer prior to the first input (this layer doesn’t exist) are all zero:
In fact, we’ll do this regardless of the length of the sequence; even for the longest sequence, we’ll still have a virtual preceding layer where the activations are all zero. When performing our usual operations:
At the very first layer, we will just concatenate the input to a big vector of zeros
One big problem with this design is that we still have a different weight matrix at every layer. Really long sequences will end up requiring many of those weight matrices. However, the last few layers will be trained for all the sequences, whereas the the very first layer that is only used for the longest sequences might have been trained very rarely.
The one additional modification that we need to turn this design into a full-fledged recurrent neural network is to share the weight matrices.
What that means is W(l) will be the same for all of these layers
- Everything about how you evaluate this network doesn’t change
- At test time, everything is exactly the same. We just initialize this virtual first layer to zero and then we proceed with these operations:
- This only affects training. During training we have to force the matrices W(l) to be the same for all the layers:
The very long sequences might require those weights at the very early layers (that we only see for very long sequences), but now those weights are the same as the ones used in the later layers and they have already been trained pretty well.
If we force the weight matrices to be exactly the same at every layer, we can have as many layers as we want
Basic design of a RNN:
- Very deep network whose depth is equal to the length of your sequence
- Every layer gets a different input: it concatenates the previous layer’s activations with the input at that time step, it passes them through a linear layer and then a non-linearity
- The very first layer gets a zero as input from the from the virtual previous layer
- The weight matrices are shared at all the layers, which means that we don’t need a number of weight matrices equal to the length of the longest sequence
How do we train RNNs?
The short answer is that we’re going to slightly modify backpropagation (you can check out my previous article on it).
When we run backpropagation, the basic design is the same. First, we run a forward pass to calculate all the a’s and z’s at each step. Then, for the backward pass we proceed in the same way. We initialize delta to be the derivative of the final loss with respect to the last layer output and then for every linear layer and for every ReLU we calculate the derivative with respect to its parameters and the derivative with respect to its inputs (which is the previous layer).
The issue is that the parameters are now the same at all these layers. That means that θf at layer 4 is the same variable as θf at layer 3. If we were to run this algorithm on the following layer:
Starting on the right, we would compute delta to be the derivative of the loss with respect to the output at layer four and then at layer four we would compute the derivative of the loss with respect to the weights and with respect to the bias. Then, we would compute the new delta, we would go back to layer 3 and at layer 3 we would override the gradient with the gradient of the w and b at layer 3. So, taken literally, the gradient at layer l-1 will overwrite the gradient of layer l — and we don’t want that. We want the derivative of the loss with respect to w and the derivative of the loss with respect to b to account for the effect of w and b on every layer, not just the first layer.
It turns out to be very very simple to fix this problem. Instead of setting the derivative “df / dθf” to be “df / dθf” x “δ”, you just need to add it to the value of the derivative:
You’ll do this:
Instead of this:
At the beginning of backpropagation, you initialize all your gradients all your derivatives to zero and then at every step of backpropagation, for every layer you add “df / dθf” x “δ” to the current value of the derivative.
What if we have variable-size outputs?
For example, we may want to generate a text caption for an image, to predict a sequence of future video frames or to generate an audio sequence. Before, we had an input at every layer; now, we are going to have an output at every layer.
One immediate thing that comes up is that each of these outputs has their own loss. At each step, just like before, we’re going to perform these operations:
Now, y hat at that layer will be some function applied to a (l) and that’s some kind of readout function — sometimes referred to as a decoder. The parameters of f are going to be shared across all the time steps, just like the parameters of the linear layer. We’re going to have a loss on each y hat(l) and our total loss will just be the sum of the losses at every step. That’s a very simple way to combine the losses.
The problem we have is that it’s not completely obvious how to use backpropagation at this point, because regular neural networks are a chain — where you get your delta from the next layer and you backpropagate it into the previous layer. Now, we’re in a situation in which a neural network has branches; at every time step, we’ll have a delta coming in from the loss of that step and we’ll have another signal coming in from the next step. We then need to produce a delta going back into the previous step.
We’re going to draw the computation graph for this procedure (you can check out my article on computation graphs here).
You have your input x, which goes into a linear layer, then a non-linearity and then we have our first readout function f1 — which could be a linear layer followed by a softmax and goes into the loss of the first time step:
Sigma 1 also goes into the second linear layer. So, it produces a value, let’s say a1, and that same a1 is used by two downstream functions: f1 and lin2.
Then lin2 goes into sigma 2 and again, same mechanism. The losses at all time steps are added up and that’s what ends up producing the final loss value.
We can think of each layer as a function f that takes as input xf and produces as output yf. We always start backpropagation at the last function, where initial delta is just one. This is a graphical way of representing backpropagation for regular neural networks:
The algorithm for regular nets is just a loop that starts at the last layer and proceeds backwards, and in each of those layers it’s going to calculate these delta “yf’s” and use them to calculate delta “xf’s” and “dL/dθ’s”. If you are unsure about this, check out my article about backpropagation.
Now, we’re going to generalize this a bit. The issue we have to deal with with RNNs is that we have some nodes whose output goes into multiple nodes. If you have a look at the computation graph I drew above, you’ll notice that the output of sigma1 goes into f1 and also goes into lin2. During backpropagation, there’s going to be one delta coming in from one of those outputs and another delta coming in from another one of those outputs. All we do is we just add them together and then we just plug them into the same algorithm.
This is also sometimes referred to as “reverse mode automatic differentiation”
What if you have multiple inputs and multiple outputs at each step?
We simply combine the two concepts we saw before. At each step, we:
- Concatenate the activations of the previous layer with the new inputs
- Apply linear layer
- Apply non-linear layer
- Have some readout function (which could be just a linear layer and a softmax) to produce the output y hat at that step
What makes RNNs difficult to train?
RNNs at their most basic level are very deep networks
We have to reconsider why it’s so difficult to train very deep networks. The chain rule just looks like a multiplication of a bunch of jacobians; for some arbitrarily deep network, the derivative of the loss with respect to the weight matrix of the first layer will be a product of a huge number of jacobian matrices. Then at the end you multiply that number dL / dz(n):
These jacobian matrices can be all sorts of different things. Derivatives of non-linearities, derivatives of linear layers, derivatives of convolutions. If we simplify for a second and think about scalars, the problem with multiplying many numbers together is that there are two possible outcomes: if most of the numbers are less than one, we’re going to get zero (vanishing gradients), whereas if if most of the numbers are larger than one, we get infinity (exploding gradients). There’s a very small set of cases where the answer is more interesting and that small set of cases is when all the numbers are close to one: the only time that you get a reasonable answer from multiplying together many scalars is if they’re all close to one. By the way, this is we we prefer ReLUs instead of sigmoids; the derivative of a rectified linear unit is the indicator of whether the input is positive, which means that a large fraction of those derivatives are just 1.
For matrices, being close to one doesn’t mean that every entry is one, it means that the eigenvalues of the matrix need to be close to one. Exploding gradients are not too difficult to deal with because we can always clip our gradients to prevent them from exploding; the big challenge with RNNs comes from vanishing gradients. An intuitive interpretation is that if your gradient vanishes, the gradient signal from later steps never reaches the earlier steps: when that happens, your neural network becomes unable to maintain memory
Promoting better gradient flow through an RNN
At each layer, we’re going to concatenate the new input to the previous layer activations and then pass them through the linear and non-linear operations. We call this the “RNN dynamics”.
The particular derivative that we care about when we talk about vanishing gradients is the jacobian of the RNN dynamics:
This is the derivative of q with respect to previous activations. Of course, we don’t just want to force the derivative to always have eigenvalues that are close to 1 (or to be close to identity), because sometimes we do want to forget things and sometimes we want to transform them in various interesting ways. We only want it to be close to identity when we really want to remember.
In other words, don’t just them to be identity, but come up with some design where the network can decide that it wants to remember something; when it decides it wants to remember, then this derivative really ought to be close to identity.
The intuition is that for each unit, we have a little neural circuit that decides whether to remember or to forget. If you’re remembering, you just copy the previous activation as it is and don’t change it; if you’re forgetting, then you override it with something else
We’ll have a cell state and multiply it by some number f(t), which is between 0 and 1. This is referred to as a forget gate, because if it’s set to 0 you forget what you had, otherwise you remember:
Then, we add some number g(t)to it.
- If f(t)is close to zero, then the cell state is replaced by g(t)
- If f(t)is close to one and g(t)is close to zero, then the cell state is retained
- if f(t)is close to one and g(t)is non-zero, then the cell state is modified in an additive way and that becomes our new cell state a(t)
Where do we get these f(t) and g(t)?
The way that we’re going to decide whether to remember or to forget is based on another temporal signal h (t-1). H (t-1) comes as an output from the previous time step, along with a (t-1).
Then we have our input at this time step, x(t). We’ll do the same thing that we did before: we concatenate x(t) with h (t-1) and we apply a linear and a non-linear layers to them. This linear layer is going to produce an output that is four times larger than the output we had before:
Each of those 4 rows is going to serve different functions. One of them is the f(t) we saw as a forget gate earlier and one of them is the g(t) we added to the product earlier.
1. We take f(t) and pass it through a sigmoid, which puts it in the range of 0 and 1. That becomes the forget gate.
2. We take i (t) and also pass it through a sigmoid. This is called the input gate and it controls the modification to the cell state
3. We take g (t) and we pass it through some non-linearity. It could be a tanh or a ReLU
g (t) is multiplied point-wise by i (t) and then you add the product to the cell state after the forget gate. Intuitively, i (t) determines whether you want to have a modification of the cell state and g (t) determines what that modification will be. This is very nice because it allows the network to independently choose whether to modify or not and separately choose how to modify
4. o (t) is called the output gate, it’s also passed through a sigmoid and it controls the next h (t). After the forget gate and after the addition of g (t), we take our cell state a (t) and pass it through a non-linearity. Then, we point-wise multiply by o (t) and that becomes the new h (t)
This new h (t) is also what we use if we want a readout function / a decoder at that time step.
What is going on here is: we’re trying to preserve the role of a (t) as a linear signal. All the non-linearities are applied to the h’s and as a result the gradients on a (t) are going to be pretty simple
- Not having non-linear functions affecting a (t) makes its derivatives very well behaved
- At the same time, we still retain the benefit of non-linearity by having a non-linear modification through g (t) and a non-linear readout through h (t)
Why do these LSTM cells work better? The simplest reason is that a (t) is modified in a very simple way at each step
You can think of a(t) as the long-term memory, while h(t) is the short-term memory — it changes all the time and performs complicated non-linear processing.
- RNNs almost always have both an input and an output at each step
- Naive RNNs like the ones in part one almost never work; in practice, if you want an RNN you will probably be using something like an LSTM cell, even though it requires more hyperparameter tuning than standard fully connected or convolutional networks
- There are also some alternatives to RNNs for processing sequences that can work a lot better in practice (temporal convolutions and transformers)
- There are some variants of the LSTM that are a bit simpler and work just as well (gated recurrent units)
Autoregressive models and structured prediction
Most RNNs that we actually use in practice have multiple inputs and multiple outputs, because most problems that require multiple outputs have strong dependencies between these outputs.
These kinds of problems are sometimes referred to as structured prediction, because the thing that you are predicting has structure, in contrast to something like a label. A classic example is text generation. Regardless of whether the output of text is the right answer or not, there are relationships between words and text that determine whether it’s a valid text.
Let’s say that you have a training set that consists of three sentences:
- “I think therefore i am”
- “I like machine learning”
- “I am not just a neural network”
Let’s say for now that you don’t care about generalization: you just want your model to memorize these three sentences. We’ll treat words as categorical variables; every time step will be a different word and it’ll just be one of n possible words.
Your neural network will read as input the word “I”. Then we ask it to complete the sentence and it can complete it in three different ways: “think”, “like”, “am”. Then we have a softmax distribution that turns these words into probabilities:
We’ll sample randomly from this softmax distribution and pick the second word. Now the network is completing the sentence and says “I think”. We then proceed to the next time step; we we have a linear layer that uses the previous activations to compute the new activations and then we have our output at the second time step, which again goes through a softmax distribution.
The trouble is that the network doesn’t know which word was randomly sampled from that softmax:
The network thought it produced an output that is 30% “think”, 30% “like” and 40% “am”. The problem is that you’re trying to sample these words independently. At the second time step you might randomly get “machine” and then at the third time step you might randomly get “just”. This generation makes no sense, even though the network did a great job at learning this distribution. We thus get a nonsense output and that’s because in this task the covariances between the words matter just as much as the probabilities of getting the correct words.
This is why it’s so important to have networks that take in variable length inputs and variable length outputs. When we run this network to complete a sentence, we’re going to sample from the softmax of the first time step and then we will feed our sample as an input at the second time step. This will basically tell the network what we sampled. Now the network knows that it’s not just predicting the third word in any sentence: it’s predicting the third word in a sentence where the first two words are “I think”. It’ll therefore correctly predict that if the first two words are “I think”, the third word is probably going to be “therefore”.
The key idea is that past outputs should influence future outputs and the way you get past outputs to influence future outputs is by treating yesterday’s output as today’s input.
This basic design is used for all RNNs that have to output a structured sequence, like image captioning models or video prediction models.
The simplest way to train it is to set the inputs to be the entire training sequence and the ground truth outputs to be the same sequence, but offset by one step. Instead of asking the neural network during training to just dream up a sentence, you’re asking to complete a sentence. You’re saying: at time step t, take all the words from time step 1 to t and output the word at time step t+1.
Your x’s are going to be all the tokens in the sentence and your ground truth outputs are going to be all the tokens shifted back by one. The last one will be replaced with a stop token.
Basically this network is being taught that if it sees “I”, it should output “think”; if it sees “think”, it should output “therefore” and so on.
There’s an issue with this. The basic design I showed is going to work well, but if you use it to train really complex networks for very long sequences, you might start seeing some problems, which are due to distributional shift.
Let’s say that you have the same example as before, but the network is making some small mistakes:
Why is it doing that? Maybe it didn’t train very well. Let’s say you got unlucky and you sampled the 10% word. The problem is that now you’re going to use it as the input of the next time step. Since it’s just some random token, the network has never seen it before and is going to be completely confused. At the next time step, it will thus output some completely crazy thing and then it’s not going to produce anything reasonable:
This is a discrepancy between training and test: the network always saw true sequences as inputs, but at test-time it gets as input its one (potentially incorrect) predictions. This is called distributional shift, because the input distribution shifts from true strings during training to synthetic strings at test time.
The basic idea is that during training we’re going to make a random decision at every step in time. With some probability, we’ll give the network the true input at that time step and with 1 minus that probability, we’ll give it its previous output as an input. Crucially, we don’t differentiate through this edge, so the network doesn’t know that it’s getting its own output as an input.
You can think of this as randomly replacing the true input with the network’s own output of the previous time step with some probability. This really helps because now during training the network sees its own outputs as inputs and therefore, when it sees that a test time, it’s not completely confused.
How do we want to choose this probability for replacing the true input with the model’s own previous output? Intuitively, we want to mostly feed in the true inputs at the beginning and then once the network gets pretty good we want to mostly feed the model’s own predictions to mitigate this distributional shift. We therefore put a schedule on this probability: the probability of using the true input starts off really high and then gradually drops. And since the probability of using the model’s output is 1 minus that, over time you’re mostly training the network on its own previous outputs
One of the big advantages of RNNs is that they afford a great deal of flexibility:
With regular feed forward networks, you have one-to-one mappings, whereas with RNNs you can have one input to many outputs, many inputs to one output, many inputs to many outputs and so on.
One of the things that we often do with RNNs is that we have an RNN backbone in our model somewhere in the middle and then we use a non-recurrent encoder (a regular convolutional network which then feeds into the RNN). You could also have some kind of complicated decoder and take the output of your RNN or LSTM cell and feed that into multiple layers before producing the output:
You could also have multiple RNN layers, take the output of your RNN at a particular time step feed it into another RNN layer and then eventually into a softmax.
Another trick you can use with RNNs is something called bidirectional models. For example, in speech recognition your inputs are sequences of sounds and you want to produce the corresponding words. Mostly, the word at a particular time step will depend on the the sound at that time step, but sometimes the rest of the sound will influence what word you think it is. So, the word at a particular time step might be hard to guess without looking at the whole utterance. A bidirectional model is an RNN where you first have a layer of recurrent connections running forward and then another layer above it running backward. That means that at you have information from both the past and the future.
If you’re interested, I discussed in this article how we can use recurrent neural networks to train and utilize sequence to sequence models.
Feel free to drop me a message or:
All images taken from Levine, CS182, 2021.