Generative modeling

55 min readJan 2, 2022

This is the fourth of a series of articles in which I summarize the lectures from CS182 held by Professor Sergey Levine, to whom all credit goes. All images are taken from his lectures. These are the first, the second and the third articles.

The central idea of generative modelling stems around training a generative model whose samples x ∼ pθ(x) come from the same distribution as the training data distribution, x ∼ pd(x).

Consider a medical diagnosis problem in which we have taken an X-ray image of a patient, and we wish to determine whether the patient has cancer or not. Generative modelling first solves the inference problem of determining the class-conditional densities p(x|Ck) for each class Ck individually. Also, it separately infers the prior class probabilities p(Ck). Then, it uses Bayes’ theorem in the form:

Equivalently, we can model the joint distribution p(x, Ck) directly and then normalize to obtain the posterior probabilities. Having found the posterior probabilities, we use decision theory to determine class membership for each new input x. Approaches that explicitly or implicitly model the distribution of inputs as well as outputs are known as generative models, because by sampling from them it is possible to generate synthetic data points in the input space.

While the range of applications to which generative models have been used continue to grow, we can identify three fundamental inference queries for evaluating a generative model:

Density estimation: given a data point x, what is the probability assigned by the model, i.e., pθ(x)?
Sampling: how can we generate novel data from the model distribution, i.e., xnew∼pθ(x)?
Unsupervised representation learning: how can we learn meaningful feature representations for a data point x?

This is a taxonomy of generative models:

Specifically, we will discuss four generative models:

Autoregressive Models
Variational Autoencoders
Normalizing Flow Models
Generative Adversarial Networks

Autoregressive generative models

Autoregressive generative models implicitly define a distribution over sequences using the chain rule for conditional probability, whereby in each step the distribution of the next sequence element is predicted given the previous elements. The main principles for training autoregressive generative models are:

We divide x into its dimensions x1, …, xn
We discretize each of these dimensions into k values (for example, when we’re generating images, the image pixels are naturally discretized into 256 values because pixels can only take take on 256 different colors)
We represent the full joint over all the pixels or over all the dimensions of x by the chain rule
Use your favorite sequence model to actually model this p(x)

Let’s walk through one popular class of models in this category, that is specific to autoregressive generative modeling of images. This is called “PixelRNN”; the idea is that we’re going to generate the pixels one at a time, left to right, top to bottom in scan-line order. So, the probability of the whole image is given by the product over all the pixels and each pixel depends on its predecessors, which are defined as all preceding scan lines and all preceding pixels to the left of the current pixel in the current scan line.

Each pixel consists of three color channels, so we also have to generate one color channel at a time — there’s actually a small network for every pixel that generates the red color given all the previous pixels, then the green color given all previous pixels and this pixel’s own red color and then generates the blue color given all previous pixels and this pixel’s red and green color. And then every color channel is a 256-way softmax, so every color channel’s brightness can take on 256 possible values.

With this model, once you’ve trained it, you can give it for instance the top half of an image and it can complete different random realizations for the bottom half.

Some practical considerations:

It is quite slow, because while the basic recipe is the same as a language model, if the image is 32 by 32 pixels, that means that there are about a thousand pixels in that image — so that would be like a sentence of a thousand words. Plus, each pixel itself has three color channels, so that raises it to 3000.
Generating the image row by row like this might fail to capture some of the spatial context that is present in images, because the pixels that are located right above in the scan line are considered to be far away in the RNN ordering.
There are many practical improvements and better architectures that are possible to improve PixelRNN (PixelCNN, PixelTransformer, …).

Conditional autoregressive models

Conditional autoregressive models solve this problem: what if we want to generate something that is conditioned on another piece of information?

For example, we could generate images of specific types of objects or we could have a conditional autoregressive model for imitation learning that generates distributions over actions.

The recipe is very similar to a conditional language model; you have some something that you’re conditioning on, which we will call “context”. The context is processed using some kind of encoder network, which could be just a feed-forward model, a CNN, or another RNN and that is used to start the first time step of the autoregressive generative model.

The encoder could be extremely simple, if for example you’re generating images of a class, or it could be very complex, if for example you represent the policy in imitation learning.

To sum up:

Autoregressive generative models are basically like language models, but for other types of data — although it’s more accurate to say that language models are just a special type of autoregressive generative model. In other words, the notion of an autoregressive generative model is actually more general than a language model
You can represent autoregressive models in many different ways (RNNs, LSTMs, PixelCNNs, Transformers, …)
There are some interesting trade-offs for these models compared to other models we’ll talk about later. One really good thing about these kinds of autoregressive generative models is that they provide full distributions with probabilities, so they will assign a probability to every possible x and those probabilities are easy to get. The downside is that they are slow for large datapoints (like images) and generally limited in image resolution.

To summarize, some pros of pixelRNNs and pixelCNNs:

They allow you to explicitly compute likelihood P(X). It’s an explicit density that we can optimize.
Being able to do this also has another benefit of giving a good evaluation metric. You can measure how good your samples are by this likelihood of the data that you can compute.
It’s able to produce pretty good samples

However, this is still an active area of research and the main disadvantage of these methods is that the generation is sequential and so it can be pretty slow.

Autoencoders

Autoencoders are a very widely used class of models for all sorts of unsupervised learning tasks. Not all autoencoders are generative models; in fact, the autoencoders that we’ll discuss here are difficult to use for generation, but we’ll talk about a special kind of autoencoder called the variational autoencoder that is actually a very good generative model.

More broadly, an autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore insignificant data (“noise”). Variants exist, aiming to force the learned representations to assume useful properties.

Let’s start with a very very high level view of probabilistic unsupervised learning models, for example things like language models. At a very high level you can think of them as having inputs where the input corresponds to the pixels in the image and the goal is to produce that same image — but in the middle you have some structure that makes this task of reconstructing the image somewhat non-trivial. So, a PixelRNN for instance doesn’t get to look at pixels that it hasn’t generated yet; it has to generate a pixel before it actually sees what that pixel is.

You can think of this as a general design for generative models: you have some input (which could be an image, a link, a natural language sentence) that goes into a model and that model has a loss, which is to produce the same thing as output. An arbitrary model could just learn the identity function, but some kind of structure is imposed on the model that makes this task of reconstructing the input image not easy and forces the model to learn meaningful representations. Examples of that structure in the middle:

RNN/LSTM sequence models that have to predict a pixel’s value based only on previous pixels
PixelCNN models that must predict a pixel’s value based on a masked neighborhood
Pixel transformer, which makes predictions based on masked self-attention

All of these models have spatial structure, which has to do with the arrangement of the pixels in the image or the arrangement of the words in the sentence. But can we use more abstract structure once we recognize that all these models basically read in x and then output that same x? More broadly, models that take in some input and then are tasked with producing that same thing as output are called autoencoders and the basic idea behind an autoencoder is to train a network that encodes an image into some hidden state and then decodes that image as accurately as possible from that hidden state.

Something about the design of the model or something in the data processing or regularization has to force the autoencoder to learn a structured representation.

Dimensionality: make the hidden state smaller (in terms of the number of dimensions) than the input/output, so that the network has to compress it.
Sparsity: force the hidden state to be sparse, meaning that most entries in the hidden state are zero, so the network has to compress the input.
Denoising: corrupt the input with noise, forcing the autoencoder to learn to distinguish noise from signal
Probabilistic modelling: force the hidden state to agree with a prior distribution. This is the approach that we can use to get autoencoders to also be very good generative models.

A bottleneck autoencoder relies on the fact that whatever the dimensionality of your input is, you pick the dimensionality of the hidden state to be much smaller. If you have a 100 by 100 pixel image and the hidden state is only 128 dimensions, it can’t possibly learn the identity function, because it can’t store the values of every single pixel in those 128 dimensions. This is very straightforward and it does have some interesting properties. This is a non-linear generalization of principal component analysis, because if both the encoder and the decoder are linear, then training this kind of bottleneck autoencoder with mean squared error for reconstruction exactly recovers principal component analysis — and the dimensionality of the hidden state is the number of principal components that you’re going to get. To sum up, the upside is that it is very simple to implement; the downside is that simply reducing dimensionality often does not provide the structure we want.
A sparse autoencoder relies on this basic principle: we can describe the input with a small set of attributes — instead of describing the image of a dog by specifying the color of each pixel, which is not structured and interpretable, you could give a 1 for each property that the dog possesses (ears, wings, wheels, etc.). An interesting property of these kinds of representations is that most attributes will be zero for most images, because there are a huge number of possible objects in the world that do not possess most attributes. That goes hand in hand with the structure; the sparsity principle says that what you want is this kind of attribute dictionary, that is very structured and that tends to be sparse. This is neat, because then the attributes that are present are very specific to that object and they’re very descriptive of it. As an aside, this idea originated in neuroscience, where researchers believe that the brain uses these kinds of sparse representations to represent things in the world (“sparse coding”). Practically speaking, the only thing we really have to do to get a sparse autoencoder is we have to choose a sparsity loss. We’re going to train the sparse autoencoder with backpropagation to minimize some loss function on reconstructing the input, but it will have an additional loss function applied to the hidden state that will encourage that hidden state to be sparse. To sum up, the upside is that it is a principled approach that can provide a disentangled representation; the downside is that it is harder in practice as it requires choosing the regularizer and adjusting hyperparameters.
A denoising autoencoder relies on the idea that a good model that has learned meaningful structure should be able to fill in the blanks. If you take an image and you corrupt it in some way and then you ask your model to remove the corruption, then your model will have to learn something about what realistic images look like. To sum up, the upside is that it is very simple to implement; the downside is that it is not clear which layer to choose for the bottleneck and there are many ad-hoc choices to make.
A variational autoencoder performs probabilistic modelling: it connects up autoencoders to probabilistic generative models. It also addresses a big problem that other autoencoders have: sampling or generation from them is very hard, which limits their uses. In general, variational autoencoders are meant to compress the input information into a constrained multivariate latent distribution (encoding) to reconstruct it as accurately as possible (decoding).

Autoregressive models define a tractable density function and based on that we can directly optimize the likelihood of training data. On the other hand, variational autoencoders define an intractable density function with latent z. Our data likelihood p(x) is now an integral over taking the expectation over all possible values of z — however, we cannot optimize this directly.

Latent variable models

As an introduction to variational autoencoders, let’s discuss latent variable models first. In statistics, latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models.

More in general, let’s say that our distribution p(x) is very complex and we can’t represent it easily. Let’s then introduce another distribution over another variable z; p(z) is just a zero mean unit variance normal distribution. Then, we’ll build our generative model as a probabilistic mapping from z to x. In other words, we are trying to represent p(x) as a composition of two very simple distributions. One of them is p(z), which is just a Gaussian; the other one is p(x|z), which is a simple distribution as well, but its parameters are very complex and deterministic functions of z.

p(x) is now the integral over p(x|z) p(z) dz. All the complex models we will discuss are based on this principle: you have an easy distribution over a very abstract latent variable z and you have an easy conditional distribution over your complicated variable x, but the mapping from z to x are represented by a neural network.

How do we train latent variable models?

We’re going to have some model pθ(x) and we have a dataset of x’s. We train with maximum likelihood:

Since each of your p(x) is:

We substitute:

But this objective is completely intractable, because you can’t integrate over all possible z’s — this integral does not have a closed form solution. Instead, when we train latent variable models, we need to use something a little more manageable: the expected log likelihood.

Instead of integrating z, we essentially guess what the z is by estimating p(z|x) for every image and then we maximize the probability of that x with that z. But how do we calculate p(z|x)? This is called probabilistic inference.

Latent variable models in deep learning

Some variable z has some prior distribution p(z). This prior distribution is typically set to something very simple, because the meaning of z is not fixed; the meaning of z is entirely determined by your model, so you don’t need to learn or design p(z) and you can choose something simple like a zero mean identity covariance multivariate normal distribution. Then, you have a decoder, which maps vector z to distributions over x: this is a neural network.

Using the model for generation works like this:

Sample z from p(z): random number generation.
Sample z from p(x|z): turn that vector of random numbers into an image

A latent variable deep generative model is usually just a model that turns random numbers into valid samples.

How do we represent latent variable models?

The easy part is representing p(z): generating random Gaussian numbers. Representing pθ(x|z) is a bit more complex.

Option one is to say that your pixels are continuous values; you would represent them as a multivariate normal distribution with a diagonal covariance, which just amounts to saying that every pixel is going to have a mean and every pixel is going to have a variance. Both the mean and the variance of every pixel could in general be functions of z, so they might be the output of a neural network. Very often, we will simplify this and make sigma be a constant — in which case you just get a mean squared error loss
Another option is to say that your pixels are discrete valued and in that case pixels all depend on z, but conditioned on z they can be seen as independent. This actually works very well, but it’s a little bit slow.
There are other choices (discretized logistic, binary cross-entropy, …)

What architecture should we use for pθ(x|z)?

An easy choice for tiny images or for non-image data is to just use a big fully connected network. A better choice for more complex data is to (a) run the z vectors through a fully connected layer to turn them into a lower resolution convolutional map (b), use transpose convolutions and unpooling to get the resolution up until it gets to the full resolution that you want for your final image.

Training latent variable models

Choice one is to perform inference to figure out p(z|x) for each training image x and then minimize the expected negative log likelihood (this is what variational autoencoders do)
Another choice is to use an invertible mapping z to x
Do not bother getting the z’s for each individual x, but try to match the distributions themselves. Basically sample a z at random, sample the x given that z and try to match those samples at the population level to the training data (this is what generative adversarial networks do)

The way that we’re going to guess the z is by taking an expected value under p(z|xi) for every data point xi in our data set and then we’ll maximize the joint log probability of xi and the z that we guessed. In reality, p(z|xi) is a distribution, meaning that we don’t know exactly which individual z goes with which xi, but we can guess the probability of differences: what we have to do is average over all those possibilities and that’s what an expected value does. The big challenge with this scheme is the way we calculate p(z|xi): we’ve essentially transformed this problem of evaluating an intractable integral into the problem of somehow estimating and then sampling from p(z|xi). This is called probabilistic inference, because we are inferring which z goes with a particular x.

The variational approximation

The basic idea behind what we’re going to do is called variational approximation. Instead of actually calculating the real distribution p(z|xi), we’re going to approximate it with some much simpler distribution that’s much easier to deal with. For example, we might just say that p(z|xi) for a particular image xi is some Gaussian qi(z) with some mean and variance — and it’s going to be a different mean and a different variance for every image.

Of course, the real p(z|xi) is not just a Gaussian, it might be some really complicated thing — so this is an approximation and perhaps a fairly crude one. The remarkable fact is that for any choice of qi(z) we can construct a lower bound on log p(xi), so the quantity that we’ll work out is not equal to log p(xi)but it but it is less than or equal to it. This means that if we’re trying to maximize log p(xi), we can maximize this lower bound quantity instead.

Here’s how we can do it: let’s start with the expression for log p(xi) that uses that intractable integral from before.

Then, we multiply by 1 and rewrite the logarithm as an expected value under qi(z):

Now we’re going to apply a mathematical identity that is at the core of variational approximation, which is Jensen’s inequality: it allows us to bound any concave function of an expectation below by the expectation of the concave function.

In other words, if you have the logarithm of an expected value, that quantity is always greater than or equal to the expected value of the logarithm. If it’s a convex function, then the inequality goes the other way.

This is what we obtain:

Now we’ll do a little bit of algebra to turn this last line into something that we can actually optimize with gradient descent. First, we’re going to apply the very useful identity that says that the logarithm of a product is equal to a sum of logarithms:

The last term is the negative entropy of qi.

The conclusion of all this discussion is that maximizing this quantity that we have at the end is going to push up on a lower bound to log p(xi), which means it will eventually maximize log p(xi) as well. A brief parenthesis about entropy:

Entropy allows us to ask:

How random is the random variable? Entropy represents how random a distribution is; if all outcomes are equally likely, then you have the highest entropy, whereas if one outcome is guaranteed, then you have the lowest entropy.
How large is the log probability in expectation under itself? For example, if you have two distributions like these:

These are continuous-valued distributions. We could ask: which one has the higher entropy? Since entropy quantifies how random a random variable is, the top one has higher entropy because there are more values that are possible, which means the probability of each of those values must be lower (because distributions always integrate to one).

Then, what do we expect from this expression?

We’re maximizing entropy over qi and we’re also maximizing the expected value of those log probabilities.

The first term is going to encourage putting a lot of probability mass on a value of z for which log p(xi|z) +log p(z) is large
The second term is going to encourage broadening this probability mass

What that means is that we’ll capture the point with the largest probability under p, but we’ll also make the resulting distribution q as broad as we can, so that it covers as much probability mass as possible and has the highest entropy.

KL-Divergence

Another concept in information theory that we should discuss is KL-Divergence. It can be thought of as:

1 A measure of how different two distributions are, because if q and p are exactly the same, then the divergence is zero:

2 A measure of how small the expected log probability of one distribution under another is, minus entropy.

The variational approximation

The expression that we derived before by using Jensen’s inequality is actually a KL-Divergence. We call this the evidence lower bound and it’s a function of two distributions: p, which is the distribution we’re learning, and qi, which is our approximation to the posterior. So what makes qi(z) good? Being able to accurately approximate p(z|xi) with a low KL-Divergence between them.

One can prove that the log p(xi) is equal to:

Where the second term is:

Therefore, if we want qi(z) to approximate p(z|xi) well, the KL-Divergence has to be low.

If we want to know how to train our model, we can train our model by maximizing the evidence lower bound with respect to the model p and, if we want to get the best qi, we also maximize the same evidence lower bound with respect to qi. This is telling us that if we just maximize the lower evidence bound, we are in fact minimizing the KL-Divergence.

The goal is then to maximize the evidence lower bound: for every xi in our dataset, we’re going to calculate the gradient of the evidence lower bound with respect to the parameters of our model. The way we do that is we first sample a z from qi(z); then, we approximate our gradient as the gradient for our model for that one sample; then, we take one gradient descent step and we update our qi to also maximize the lower expected bound (there are different ways of doing so).

Let’s start with something simple: let’s just say that qi is a Gaussian distribution over z, with a mean and a variance. For every image in our dataset, we have a different mean and a different variance that we’ll keep track of.

Regular variational inference
One way that we could update mean and variance is we could compute the gradient of the evidence lower bound with respect to those quantities and then just do gradient descent on those. But there are two problems with this approach. The first is the number of parameters we need to optimize grows (at least) linearly with the number of observations. Not ideal for massive data-sets. The second is that if we get new observations, or have test observations we would like to perform inference for, it is not clear how this fits in to our framework. In general, for new observations we would need to re-run the optimization procedure.

2. Amortized variational inference
Amortized VI is the idea that instead of optimizing a set of free parameters, we can introduce a parameterized function that maps from observation space to the parameters of the approximate posterior distribution. In practice, we might (for example) introduce a neural network that accepts an observation as input, and outputs the mean and variance parameter for the latent variable associated with that observation. We can then optimize the parameters of this neural network instead of the individual parameters of each observation.

Notice that this directly addresses the issues mentioned earlier. First, the number of variational parameters is now constant w.r.t. to the data size! In our example, we only need to specify the parameters of the neural network, and that is not dependent in any way on the number of observations we have. Second, for a new observation, all we need to do is pass it through the network, and we have an approximate posterior distribution over its associated latent variables. At the constant cost of a forward pass through a network! These gains are the source of the term amortized.

Taking a gradient ascent step on φ: by computing the gradient with respect to φ of L. φ appears in two places in the evidence lower bound: (1) it appears as the parameters of the distribution under which we compute the expectation (2) it appears inside the second entropy term

This is our evidence lower bound:

The gradient of the entropy term is easy to handle, but the first part is a little more complex. One way that we can think about calculating the gradient of the first part is to note that all the stuff inside the expectation does not depend on φ, so we can collect those terms into a single expression I’m going to call r(xi,z):

This is similar to the objective in reinforcement learning, which is to maximize the expected value of the reward where the expectations are taken with respect to a distribution determined by the policy. Here, we have the expected value of some function r, where the distribution is determined by qφ: we could just use policy gradient.

The simple version would be:

The shortcoming of this expression is the same as the shortcoming of policy gradient in general: the variance is large, which means that we might in practice have to generate quite a few samples zj to get an accurate gradient estimate.

There is a much better way that we can train these models that’ll be much faster and more effective in practice; it’s going to leverage the fact that unlike in the case of reinforcement learning, in the case of latent variable models we can actually calculate the derivative of r. This is called the reparameterization trick.

Here’s the idea. We have our expression:

For which we want to calculate the derivative with respect to φ, but φ is not just any distribution: it’s a normal distribution. This means that we could also write z as a deterministic function of the mean and variance and epsilon, which is a standard Gaussian sample:

This is because a Gaussian with a non-zero mean and a variance that is not one can just be expressed as a affine transformation of a zero-one Gaussian. Now everything here is deterministic except for epsilon, but epsilon does not depend on φ. It doesn’t depend on the neural network, so we can equivalently write our objective as the expected value with respect to epsilon, where we substitute this equation for z:

φ only appears inside the expectation: the distribution with respect to which we’re taking the expectation is not dependent on φ anymore. We’ve re-parameterized the distribution and that’s why it’s called the reparameterization trick. So now we can estimate the gradient with respect to φ as follows: we can sample a bunch of epsilons from a zero mean unit variance Gaussian and then the derivative with respect to φ is just applied by using the chain rule of differentiation. The reason that we cannot do this in reinforcement learning is that the derivatives of the rewards are not known to us.

Another way that we can look at our evidence lower bound objective works this way:

Theta only enters into the first term; the second and third terms are independent of theta and if we group them together we’ll notice that it’s just the negative KL-Divergence. It’s very convenient to express the evidence lower bound like this because all the hard stuff is basically in the first term

If we apply the reparameterization trick, then both theta and phi now appear inside the expectation and the distribution under which we’re taking the expectation does not depend on either of the parameters. We can estimate this using just a single sample of epsilon by just plugging in that epsilon.

It might be instructive at this point to draw the resulting neural networks:

This is how you implement a variational autoencoder. The reason this is called a variational autoencoder is because if you look at the structure of this neural network, it looks like an autoencoder. It has an encoder with parameters φ, which has this unusual noise, mean and variance thing at the end; then, it has a decoder with parameters θ. So, it’s just an autoencoder where, unlike for the denoising autoencoder, the noise is put into the hidden state z instead of being put in in the beginning. But the variational autoencoder has this very appealing interpretation as a latent variable model.

Reparametrization trick vs policy gradient: what are the tradeoffs?

The policy gradient:

Can handle both discrete and continuous latent variables
It has high variance and requires multiple samples and small learning rules

The reparameterization trick:

It only handles continuous latent variables
It’s very simple to implement and does work quite a bit better if you can use it

Now, let’s put things together and go through the full variational autoencoder.

The variational autoencoder is a latent variable model that has latent variable z and it has observed variables x

It has an encoder, which performs inference:

This is a neural network that takes in x and it produces a mean and a variance over z.

It has a decoder:

This is a neural network that takes in z and it produces a mean and a variance over x.

This is the objective:

The first part of the objective looks like an autoencoder objective: you encode xi into mean and variance, get z and then you decode to get the resulting distribution over x and you try to maximize the probability of the real image.

The second part can be thought of as a kind of penalty that penalizes how much qφ(z|xi) deviates from the prior p(z). From this, we can derive a little bit of intuition as to why the variational autocoder works: because that second KL-Divergence term encourages the encoded z’s to look similar to samples from p(z), at test time the decoder will know what to do with those z’s.

We can also train conditional models of variational autoencoders; if you have a conditional model, then you are mapping x to some output variable y and you would like p(y|x) to be some fairly complex distribution. We’ll use the same intuition as before: p(y|x) might be complex, but p(y|x, z) will be simple. Everything is just like before, only now we’re generating y and everything is conditioned on x.

VAEs with convolutions

Let me walk you through a potential architecture we might use for a VAE with convolutions. You have your image and first it has to go into the encoder. This is the architecture:

So far, everything is the same as any other convolutional autoencoder, but now we’re going to have one fully connected layer that outputs a mean and another fully connected layer that outputs a variance:

Then we’re going to sample our normally distributed epsilons, multiply one of them by sigma and this gives us z.

This is our bottleneck representation: if we want to use a variational autocoder for representation learning, for example for a downstream classification task, this is the vector that we would use as our compressed representation of the image.

But to train the VAE we also need a decoder. For that, we’re going to use transpose convolutions that up sample the vector all the way back up to a 64 by 64 image. Then, we train this whole thing end to end with backpropagation by using two loss terms:

Negative log probability of the image
KL-Divergence between the distribution determined by μz and σz and our prior p(z), which is typically a zero mean unit variance Gaussian.

Our decoder outputs two numbers for each pixel: a mean and a variance.

VAEs in practice

One common issue is that it is very tempting for VAEs to ignore the latent codes or generate poor samples. The reason is that if you can approximate your x with just a regular Gaussian distribution, it’s much easier especially for conditional VAEs to do this, because initially the z just looks like noise and figuring out how to use it can be pretty hard. You can tell that this is happening when you get back a blurry “average” image that does not look like the input.
A second problem you might get is that the latent code is not compressed. You can tell this is happening when your reconstructions are great, but when you sample a z from p(z) and then decode, you’ll get an unrealistic image.

For problem 1, the KL-Divergence is too low: typically, qφ(z|x) ignores x and just outputs a zero mean a unit variance. This means the z’s carry no information about x, which means your decoder pθ doesn’t use the z’s. That is why it produces those blurry average images: there’s no information in z that would be useful for actually reconstructing x.

Problem 2 is characterized by the opposite, the KL-Divergence being too high: there is too much information being packed into z, which means that you can reconstruct x perfectly but because it’s so far from the prior when you sample z from it, your decoder doesn’t know what to do with it.

To control the KL-Divergence, a very common trick is to put a multiplier in front of that KL-Divergence. You adjust this hyperparameter depending on which problem you’re seeing: if you have problem one, then you need to decrease beta, whereas if you’re seeing problem two, then you need to increase beta.

Sometimes it is difficult to find the beta that both produces good samples and good reconstructions early on in training. Intuitively, early on in training the VAE needs to learn to pay attention to z and later on in training it needs to do a better job of compression. It can then be very useful to actually schedule beta to alter beta. For example, you could begin with very low values of beta, so that the VAE learns to use z and then later on raise beta to get the samples to be good. In other words, once your reconstructions look good you can increase beta and then your samples will become better too

Invertible models and normalizing flows

We’re going to talk about a different kind of model that is structurally a little bit similar to a variational autoencoder, but can be a lot simpler to train. Here’s the idea: instead of having the decoder represent a distribution over x given z, we’ll have a deterministic decoder that directly turns z into x.

Now if we want to compute p(x) for training, we can use the change of variables formula, which says that if x is a deterministic function of z, then p(x) is equal to:

What it means is that if you have some distribution over z and z is transformed in a deterministic way to get x, then to get a density over x all you have to do is modify the density over z by how much of the volume changes according to f.

If we can calculate the determinant of the jacobian df/dz and we can calculate the inverse, then we can implement a model of this sort with a deterministic function that maps these stacks. So that determinant is just a correction for the change in local density due to f. The basic idea is therefore that if we can learn invertible mappings from z to x, that can make this determinant easy to compute and we can build these kinds of models without having to worry about all that variational inference stuff. We also don’t need lower bounds and can get exact probabilities.

This class of models is sometimes referred to as normalizing flow models. Our training objective is just maximum likelihood: maximize the log probability of all the xi’s in your dataset. We use the expression from the change of variables formula on the previous slide to write p(x) in terms of p(z); p(z) is still some very simple distribution. If we substitute this formula into the training objective, our objective is:

We are going to choose a special structure that makes this easy to calculate. A normalizing flow model consists of multiple layers of invertible transformations, so all we need to do in order to implement these types of models is figure out how to make an invertible layer and then compose many of these invertible layers to make a deep network.

A few observations:

If each layer is invertible, the whole thing is invertible
Oftentimes, invertible layers also have very convenient determinants
Log-determinant of whole model is just the sum of log-determinants of the layers
Our goal is to design an invertible layer and then compose many of them to create a fully invertible neural net

There are many different ways to design invertible layers; I’ll start with one of the simplest types, called non-linear independent components estimation (NICE).

Ordinarily, a layer in neural net would be some mapping from z to x. For example, x is given by a ReLU applied to (Wz+b). But we can’t use this because this is non-invertible. Here’s an idea: what if we force part of the layer to keep all the information so that we can then recover anything that was changed by the non-linear part?

We can take our z and split it into two parts. The first part goes from number 1 to d, the second part goes from number d+1 to n. X will also have two parts: the first part is copied entirely to the first part of z, but it is also passed through some neural net and it is added to the second part of z.

The question becomes: if we have x, can we recover z? If so, then this layer is invertible. We can then compose an entire net of such invertible layers.

Recover z(1:d) = x(1:d)
Recover gθ(z(1:d))
Recover z(d+1:n) = x(d+1:n) — gθ(z(1:d))

The trick that we needed to make this work is that the first part is preserved losslessly and the non-linear component depends only on that first part. This means that you can recover the nonlinear components and subtract them off.

What about the Jacobian?

We need the determinant to compute our normalizing flow objective; both x and z have these two parts, so this would be df/dz as a matrix:

The derivatives of this matrix are:

However, you might know that the determinant is not affected by the lower left block, because the upper right block is zero. So the determinant for a matrix like this is a lower triangular matrix; it is just determined by the diagonal entries and the diagonal entries are all one, therefore the determinant is 1.

That’s very convenient: because of this particular form, for these invertible layers the determinants are always one, which means that that term in the objective has no effect on training the model whatsoever. Unfortunately, it’s representationally a bit limiting, because the determinant is always one and thus we can’t change the scale the determinants always. This kind of network only transforms z’s into x’s, but it never rescales them, so it can be very difficult to train very complex distributions in this way.

So in practice what we see is if we take this kind of invertible layer and we stack a few of these layers and use them to train a generative model, we can do a pretty good job of modeling simple datasets (MNIST), but when we start looking at more complex images, it tends to break down (CIFAR10).

There’s a small change we can make to this model that makes it substantially more expressive and this is something called Real-NVP: Non-Volume Preserving Transformation. We are going to split z and x into two parts like before, but now we’ll have two neural nets: a scaling network and a biasing network. We’ll take the second half and we’ll multiply it by the first neural net and then we’ll add the term from the second neural net, so that the second half of x is given by the second half of z times the exponential of the output of the first neural net plus the second neural net:

This is also an invertible operation and the way you derive the inverse is very similar to before. You recover the first half of z by just using the fact that it’s equal to the first half of x; once you’ve recovered that you can recover gθ and hθ and then you can solve for the second half of z by just taking the second half of x, subtracting g and then dividing by the exponential h.

Now the jacobian has a much more interesting form:

The bottom right block now is the diagonal of these exponential values, which means the determinant is not one anymore. It is significantly more expressive at the cost of adding a second non-linear component and this can generate much higher-quality images

Some concluding remarks about normalizing flows

One big advantage of normalizing flows is you can get exact probabilities or likelihoods from them; you can actually evaluate p(x), whereas a variational autoencoder only gives you a bound on p(x)
You don’t need any lower bounds
It’s conceptually somewhat simpler

It also has a few major disadvantages:

It requires a special architecture, so you can’t just use regular layers
Since each layer is invertible, the dimensionality at every layer has to stay the same and this can be a really big deal with high resolution images. If you have a million pixels in your image, then you need a million dimensions in z, whereas a variational auto encoder can have a latent code that is much lower dimensional than the image and this can be a major limitation

Generative adversarial networks

The idea that we’re going to explore in today’s lecture is whether instead of training an encoder, like we did with VAEs, we could just train the whole model to generate images that look similar to real images at the population level.

In other words, instead of trying to guess for every real image what its corresponding z is and then using that to supervise our network, what if we just generate a bunch of random images and then try to alter our network so that those random images at the population level resemble the population of realistic images given to us in our training set? This means that we want to sample from complex, high-dimensional training distribution and there is no direct way to do this; the solution, therefore, is to sample from a simple distribution (like random noise) and then learn a transformation to the training distribution (using a neural net).

An example: let’s say that I have two sets of faces. No two faces are the same, but they look kind of similar at the population level, so if I were to take five samples from one distribution and five samples from the other distribution, probably you would not be able to tell which distribution those came from. In that case, we would say that these two distributions are similar at the population level. This is how we can basically try to match distributions without having to figure out what the latent variables for any particular image actually are. I might also ask you whether you can guess which set of faces is actually real: if I tell you that some of these faces were in fact generated by a latent variable model, would you be able to guess which set it is? These kinds of models that match distributions at the population level can get extremely good and much better than this.

Here’s how we’re going to set this up: we’re going to set up a game where the only winning move is to generate realistic images. In other words, the idea is to train a network that is going to guess which images are real and which are fake; this classifier is going to output a probability that this image is a real image. In order to train a network like this, all you need is a labeled training set; the fake images can come from our generative model, whereas the true images can come from our training set.

This model can then serve as a loss function for our generator: in other words, the loss function of our generator that turns z’s into x’s could be the realism score of the resulting image as produced by this classifier. The basic intuition behind this approach is that we’re going to learn a loss function by training a classifier to classify between fake images and real images and then we’re going to use this as a negative loss for our generator to get the images that it generates to be more real according to this classifier.

Generator network: try to fool the discriminator by generating real-looking images
Discriminator network: try to distinguish between real and fake images

The core idea of a GAN is based on the “indirect” training through a discriminator, which itself is also being updated dynamically. This basically means that the generator is not trained to minimize the distance to a specific image, but rather to fool the discriminator. This enables the model to learn in an unsupervised manner.

Here is a potential algorithm:

Get a true dataset of real images
Get a generator Gθ(z), which can even be initialized randomly
Generate a dataset of false images
Train your discriminator, which is just a classifier that tries to predict whether the image came from the true dataset or the false one
Use this discriminator to construct some kind of loss function for training G(z)

This recipe almost works, but it has a few major problems that we need to address before we can construct a viable method.

If we do step five only once, the big problem is that in step five it’s very easy for G(z) to fool D(x). If D(x) was trained for fake images coming from a really bad generator, then the new generator against step five just needs to be better than the old one, but it might still not be very realistic. In order to make it work, we choose an iterative approach. After getting a generator, we’re going to sample our false images and then update our discriminator with only one gradient step: we won’t actually train the discriminator convergence, we’ll just take one gradient step. That will give us a slightly better discriminator, then we will update our generator also for only one gradient step and then we’ll generate new false images; so instead of just generating all the false images up front and training your discriminator convergence, you’ll actually alternate between updating the discriminator to get better at classifying real from fake and then updating the generator to generate more realistic images. It’s going to be a kind of game where the discriminator will keep trying to keep up with the generator and the generator will keep trying to fool it. Because they have this kind of competitive process, the only way for the generator to win the game is to produce images that are truly indistinguishable from real images

Summary

A known dataset serves as the initial training data for the discriminator. Training it involves presenting it with samples from the training dataset, until it achieves acceptable accuracy. The generator trains based on whether it succeeds in fooling the discriminator. Typically the generator is seeded with randomized input that is sampled from a predefined latent space. Thereafter, candidates synthesized by the generator are evaluated by the discriminator. Independent backpropagation procedures are applied to both networks so that the generator produces better samples, while the discriminator becomes more skilled at flagging synthetic samples.

A little bit more intuition for why this basic idea actually works

What does G(z) want to do? Ideally, its goal is to somehow persuade the discriminator to output 0.5 for all generated images. In order to do this, it has to generate images that look realistic; but the the somewhat more subtle point is that in order to really get 0.5 on all of your generated images, you don’t just need to generate images that look realistic, you actually need to generate all possible realistic images and this is why GANs match distributions.

Let’s say that we have a dataset that has photographs of cats and dogs; half the pictures are cats and half of them are dogs. Our generator has gotten really good at generating very realistic images, but it only generates dogs, so it has not covered the whole distribution. Imagine what the discriminator is going to do in this case:

If we take a picture of a cat and we pass it to the discriminator, the discriminator will say that the probability that it is a true image is 1. This is because the training set has about half the true images containing cats, but none of the false images contain cats.
If we take the picture of a dog and we pass it to the discriminator, the discriminator will say that the probability that it is a true image is 0.25. This is because the training set has about half the true images containing dogs, but all the false images were dogs.

This is why in order for the generator to win this game against the discriminator, it really has to not only generate realistic images, but all of the realistic images. For full disclosure, a particular issue with GANs that I will not talk about too much in this lecture is that in reality in practical instantiations these models do often suffer from something called the mode collapse problem. The mode collapse problem is a type of overfitting and when that happens GANs do in fact fail to capture the entire distribution.

Technical details of actually setting up GANs

The classic GAN described by the previous procedure can also be expressed as a two-player game — a min-max problem where you have two players that are respectively trying to minimize (the generator) or maximize (the discriminator) the same objective. This expression is basically just cross-entropy loss: it takes the positive labeled images and maximizes the log probability of those; then, it takes the negative labeled images and maximizes the log probability of those.

In a binary classification problem, the probability of a negative label is just 1 minus a positive label, so that’s why you see “1 — D”. Both these are expected values:

D is trying to maximize the log likelihood, such that D(x) is close to 1 (real) and D(G(z)) is close to 0 (fake); G is trying to do the opposite. It is no longer a conventional gradient descent optimization, it is in fact the problem of finding the Nash equilibrium in a two-player game. Here is the GAN game, rewritten in terms of θ (parameters of G) and φ (parameters of D):

The algorithm is basically the same as alterning between a gradient ascent step on φ, because φ is maximizing, and a gradient descent step on θ, because θ is minimizing. These two steps just alternate.

There are two important details that we have to get right in order for this to work:

How to make this work with stochastic gradient descent or ascent, because we want to use a mini-batch

For using stochastic gradient ascent or descent, we approximate the gradient by using a mini batch of training points for the first expectation and then for the second expectation we sample the same number of z’s and run them through the generator. That basically gives us a mini batch gradient ascent or mini batch gradient descent version of this. The loss is cross-entropy loss. The generator loss is just estimated with the same mini batch of z’s; you just run them through the generator, ask the discriminator what probability lends to them and then use that as the loss for your generator.

2. How to compute gradients

Computing these gradients is pretty straightforward; the gradient for the discriminator is just cross-entropy loss; the gradient for the generator may be a little bit more elaborate, because computing the loss function for the generator requires generating images and then passing them into the discriminator. The way that we calculate gradients for the generator is we backpropagate through the discriminator into the generator

What particular objective does the GAN actually optimize?

Can we somehow take this two-player game, analyze it and be able to state formally that at convergence this should minimize some measure of discrepancy between the data distribution and the distribution of images produced by our GAN?

For the purpose of theoretical analysis, we express the optimal D(x) in closed form as a function of G(z); we have an outer minimization and inner maximization, so maybe the solution of that inner maximization can be expressed in closed form as a function of G. We obviously can’t implement that in code, we need to actually train a neural net to do that, but maybe at least theoretically we can characterize the solution for D as a function of G

A very useful property for this is an identity that can be used to describe the Bayes’ optimal classifier. Let’s say that we have a set of positively labeled images (coming from p(x)) and a set of negatively labeled images (coming from q(x)). P(x) is p(data) and q(x) is the distribution of our images obtained by sampling z from p(z) and then running that through G. The optimal discriminator which I’m going to call D* is just the arg max of the log likelihood, which we can write as the expectation under p of log(D) plus an expectation under q of log(1-D).

Now we’re going to take the derivative of this with respect to D; we’re not taking the derivatives of any parameters, we’re just taking the derivative with respect to the function D and we’re going to set that derivative to zero and solve for D. If we can solve for the D that sets us to zero, we will have recovered the optimal discriminator. Crucially, that doesn’t mean that your neural net can represent that optimal discriminator — this is more in a case where we have a very expressive function and we can just not worry about parameters and just characterize the function that would be the optimal solution.

What I’m going to claim is that if I plug this in:

Then the derivative will be 0. An expected value is just a sum over all possible x’s of the probability times the quantity inside the expectation, so the expected value under p(1/D(x)) is just a sum over all x’s of p(x) divided by D(x):

The same goes for the second term. In both expectations, the denominator and the distribution with respect to which we’re taking the expectation cancel out, so we’re just left with:

Which equals 0. This expression D*(x) is in fact the optimal classifier that we can get if our positively labeled points come from p of x and our negatively labeled points come from q of x. This is the Bayes optimal classifier.

Taking p data(x) to be p and pG(x) to be q, we can derive the optimal classifier D*(G) as a function of G. If G is fixed and we optimize the discriminator’s convergence, if our neural net is very very expressive, in the limit this is what we should get:

Now, what’s the objective for G? We can simply plug in D*G(x) into the min-max optimization of above and turn it into just a minimization over G. I emphasize that this is purely a theoretical exercise, the only reason we’re going through this is in order to understand what the GAN objective really is. So if we substitute this in. we basically just have to write out:

What does the GAN optimize?

Here’s the expression that we arrived at: it has these two funny expected value expressions. Do they actually mean something?

Well, let’s define another distribution that we’re going to call q, which is the average of p data and pG. So, we’ll define q(x) as:

Then, we can write our expression for the generator objective as:

Each of these terms is a KL divergence:

It turns out that this expression is actually another divergence measure called Jensen-Shannon divergence. It has some interesting properties:

It goes to zero if the distributions match
Unlike the KL divergence, it’s symmetric

This means that the GAN really is trying to match the data distribution

A small practical aside about GANs

Following all the derivations that we had so far, the conclusion should be that the generator loss is:

Which amounts to saying: let’s minimize the probability that the image is fake. In practice, we actually often use a slightly different loss:

Which amounts to saying: let’s maximize the probability that the image is real. These two things are very similar and they represent the same concept, but their gradients look a little bit different when you’re very far from the optimum. The reason why it is better to maximize the probability that the image is real has to do with the shape of the function log x and the function log 1 minus x:

If you have a very bad generator, you’re going to have a large loss value, so you’re going to be on the left side of these two curves.
If you have a very good generator, you’re going to have a low loss value, so you’re going to be be on the right of these two curves

The problem with minimizing the probability that the image is fake is that the gradient tends to be pretty small when the generator is bad and then it becomes really steep on the right once the generator is good; we don’t really want that because it’s most important for the generator to learn very well when the generator is bad. If it can’t make progress when it’s bad, it’s never going to get good

A few practical notes on GAN architectures: how do you actually set up the GAN?

You could have a fully connected architecture. You could take your your z’s and mount them through two fully connected layers with 512 units each and produce an MNIST digit. Then you could have your discriminator also have two fully connected layers of 512 units each and produce a true/false output, with a sigmoid non-linearity at the end.

You could have a more complex convolutional GAN, maybe similar in spirit to some of those VAE architectures where you might have some transpose convolutions actually produce a higher resolution image and then a convolutional discriminator that produces your true/false label

A BigGAN is a very modern GAN that produces high resolution images. This architecture is much more complex but it is conceptually quite similar to the previous one. The basic principle is the same: it’s going to be some kind of transpose convolutional style structure

Conditional GANs generate images of a particular class. All you have to do is you have to concatenate your class label shown here as y to both the input to your generator and the input to your discriminator. You take your class label represented as a one-hot vector and just concatenate it to z and then for the discriminator concatenate it to x

CycleGAN is concerned with a very particular problem: not just generating any image, but specifically translating from one domain into another. For example, if you get a large collection of photographs of zebras and a large collection of photographs of horses, can you turn a zebra photograph into a horse? The problem is that you would be given these images but you won’t be given pairs: it’s not like english to french translation, where you get a sentence in english and a corresponding sentence in french. You just get one collection of horses and a different collection of zebras

We have two conditional generators and two discriminators:

G turns X into Y, so it’s conditioned on x, which might be a photograph of a horse, and its job is to generate y, photographs of zebras.

Dx is trained where positives are real pictures of horses and negatives are generations from G and Dy is trained where positives are real pictures of zebras and negatives are generations from F. But of course in order for F to generate a sample it needs to be given some x, so F will generate a zebra given a horse. We can now imagine training these two generators and two discriminators and now we’ll have a generator that turns horses into zebras and another generator that turns zebras into horses. But we’re not done yet, because nothing about this says which horse or which zebra you should generate. So the problem is why should the translated zebra look anything like the original horse?

That’s where the cycle consistency loss comes in: if I turn this horse into a zebra and then turn that zebra back into a horse, I should get the same horse and that’s measured in terms of distance. Literally, the pixels in the image should be close together in terms of their absolute value difference. So, that is the idea behind CycleGAN: two GANs, one for each domain, and a loss that says: if you map from one domain to another and then back, you should end up in the same place where you started.

Brief summary

The GAN is a 2-player game. This is the equation for it:

We can derive the optimal discriminator and from this determine what objective is actually minimized at the Nash equilibrium

This does not guarantee that we’ll actually find the Nash equilibrium; so, this derivation just shows that if we find the Nash equilibrium, the fixed point for this GAN, it will minimize the divergence.

We can train either fully connected or convolutional GANs and there’s a lot of liberty that we have in terms of determining their architecture

Modern techniques for training GANs

If you just implement GAN training the way I described until now, you’ll probably find that in all about the simplest problems, it requires an enormous amount of hyperparameter tuning for it to work well. One particularly difficult scenario for GANs that perhaps illustrates this problem is the following: let’s say that x’s are one-dimensional, so that I can draw them on the slide. Now, let’s say that these blue circles represent real data; let’s say that these orange circles represent samples generated by your current generator. You could imagine the distributions like these:

If I train my discriminator with pdata and pG, my discriminator might look something like this:

It’ll be perfectly 1.0 close to the real data and it’ll be perfectly 0.0 close to the generated data and we’ll have some decision boundaries somewhere in the middle, very far from either distribution.

Remember that the generator learns entirely by using the gradient through the discriminator, so what will the generator gradient actually be near the generated data? Well, the discriminator is outputting zero pretty uniformly over all those points.

How can we make sure that even when the generator is very bad compared to the discriminator and when the generated data is very far from p data, the discriminator still gives us a meaningful gradient signal?

One thing we can do is we can somehow alter our discriminator; maybe we can somehow change how the discriminator is trained, so that even if it can perfectly classify p data from pG, it is still encouraged to produce a smoother slope between these two distributions to give the generator some gradient signal to guide it towards p data

Another thing we could do is we could basically alter the distributions a little bit, maybe we can somehow modify p data and pG so they overlap more and then the discriminator will not be quite so sharp — then we’ll see a stronger gradient signal

Both of these are viable ideas and we can actually instantiate this intuition in some practical methods. Here are some improved GAN techniques that are good to be aware of in no particular order:

Least-squares GAN is a way to modify the GAN discriminator to output a real valued number instead of a probability between zero and one and train the discriminator so that it produces a smoother slope
The Wasserstein GAN is a way to modify the discriminator to be Lipschitz-continuous, which also encourages it to have more of a cleaner slope between p data and pG
Gradient penalty is a way to improve the Wasserstein GAN, so the discriminator is constrained to be continuous even harder
Spectral norm is a way to really constrain the discriminator to be continuous
Instance noise is trying to change the distributions by adding a lot of noise both to p data and to generated samples in the hopes of getting their distributions to overlap more

In this article, I’m going to focus primarily on Wasserstein GAN and then briefly remark gradient penalty and spectral norm, but do keep in mind that things like least-squares GAN and instance noise are also viable choices

Wasserstein GAN

The high level intuition is that the Jensen-Shannon divergence used by the classic GAN doesn’t have any way to account for distance. Essentially if you have this scenario:

You can say these two distributions are really far apart. Or you can have this scenario and notice that these two distributions are really close together:

In terms of JS divergence, these two cases are about the same because in both cases the overlap between the distributions is negligible. So, the JS divergence, KL divergence and most other divergences that purely use the probabilities are almost identical in these two cases, even though clearly the bottom one is much much better. This is the high-level reason why it is so difficult to get a meaningful gradient in a GAN when the generator is very bad: because a good gradient should tell you that the way to improve the situation up above is to move towards a situation down below, but if your divergence measure thinks these two distributions are almost the same, then of course it’s not going to have much gradient.

The reason that this is true mathematically is if you look at the form of the JS divergence in the form of the of the GAN objective, you’ll see that basically it’s expressed in terms of log probabilities under one distribution in expectation under another distribution. So if p data is approximately zero for all the x’s where pG is not zero and pG is approximately zero for all the x’s where p data is not zero, then these expressions will just not give you any meaningful gradient. That basically explains why the top situation is considered to be very similar to the bottom situation, as far as JS divergence is concerned, despite the fact that in the bottom case these distributions are really close together.

What we can do to rectify this problem is to try to come up with a way to train a GAN with a different divergence measure that better captures how far apart two distributions are. Here’s how you can think about a better metric: consider how far apart all the bits of probability are in Euclidean space; imagine those distributions are basically piles of material piles. How far do you have to travel to move the dirt from one pile into another? This is sometimes referred to as an optimal transport problem. How far do you have to go to transport one distribution to another? The total distance you have to travel is how far apart these two distributions are.

Wasserstein distance is formally defined as the expected value of the distance between y and x, where y and x are distributed according to the optimal choice of joint distribution gamma. So, you find the best gamma for minimizing this expected value.

Gamma is a joint distribution over x and y, where the marginal with respect to x is p data and the marginal with respect to y is pG. The intuition is that correlations between x and y in gamma(x,y) indicate which x should be transported to which y

In order to evaluate the Wasserstein distance, you find the gamma that minimizes the expected value of these distances, that minimizes the discrepancy between the x and the y for every y that goes to that particular x. Intuitively, finding gamma is like finding the optimal plan for moving all the points in p data into pG.

Unfortunately, actually learning gamma directly like this is very very hard, because p data is unknown and gamma(x,y) might itself be some really really complex distribution

There’s a really cool theorem that we can use based on Kantorovich-Rubinstein duality, which provides us with a much more tractable way to solve for the Wasserstein distance. The statement is that the Wasserstein distance is also equal to the supremum over all possible functions f of the expected value under p data of f minus the expected value under pG of f:

The crude intuition is that f is gonna come from the same place where you get a Lagrange multiplier. So, you have a constraint on gamma, which is the marginals of gamma need to match p data and pG and you’re gonna basically take the dual of that. The really appealing thing about this expression for the Wasserstein distance is that it expresses the difference of expectations under pG and p data, just like a regular GAN

The supremum was taken over this funny expression that says:

That’s shorthand for saying that you should do this over functions that are 1-Lipschitz. This means that the difference between f of x and f of y should be less than or equal to the difference between x and y. This is kind of equivalent to saying that the function is bounded slope — so it should never be too steep, because if f is too steep, then it can arbitrarily maximize f of x under p data and arbitrarily minimize it under pG. This limit on slope is a kind of speed limit on how fast you can travel when you’re carrying bits of one distribution to another.

How do we enforce that the slope of f should be bounded by some constant? That is basically the hard part of instantiating WGAN. Here’s an idea, which is not necessarily the best one, but it’s one:

If you have just a 1-layer network, so one linear layer followed by a ReLU non-linearity and all of your entries in W are between -0.01 and +0.01, then the slope can’t be bigger than 0.01 x D.

If you have a 2-layer network, then you can constrain all the entries in both weight matrices to be between -0.01 and +0.01. Then your slope is also going to be bounded; of course it’s a lot bigger than before, because you’re actually multiplying these w’s together, but it’s still bounded by a constant so this doesn’t guarantee that it’s 1-Lipschitz, but it does guarantee that it’s something-Lipschitz for some constant, which means the slope will be bounded, which means that you will get some multiple of the Wasserstein distance.

The caveat is that weight clipping is clearly terrible as a way to enforce a Lipschitz constraint. If the clipping parameter is large, then it can take a long time for any weights to reach their limit, thereby making it harder to train the critic until optimality. If the clipping is small, this can easily lead to vanishing gradients when the number of layers is big, or batch normalization is not used.

But if we want to work through this Wasserstein GAN procedure, here’s how it works. We would have our generator just like before and we would have a discriminator that now instead of outputting the probability that the image is real, it actually just outputs a real number. That real number is unconstrained, so you just have a bunch of linear layers with ReLUs and no sigmoid at the end. The discriminator is going to use weight clipping, so here’s how that’s going to work.

Every iteration of your stochastic gradient algorithm you would update the discriminator using the WGAN objective. Then, you would clip all the weight matrices inside theta to be between some constants and that would ensure that the discriminator is k-Lipschitz for some fixed finite k. Finally, you would update your generator to maximize the expected value of f of G(z)

How do we do better discriminator regularization, as weight clipping is a terrible way to enforce a Lipschitz constant?

One very simple and very effective method is called gradient penalty. If what we started with is the idea that we want to bound the slope, why don’t we just directly write down an expression for the slope and just add that to our objective?

Spectral norm is a somewhat more complex approach. The trouble with gradient penalty is that it’s not actually a constraint, it’s just a penalty. So you really need a Lipschitz constraint, but this is just giving you a penalty. If you actually want a constraint, you can bound the Lipschitz constant in terms of singular values of each weight matrix in your network.

If your function f of x is a composition of multiple functions, the Lipschitz constant can be written as a composition of three functions. All we have to do is to bound the Lipschitz constant of every layer of our network. The maximal slope of ReLUs is one, so you don’t have to worry about them; the maximal slope of linear layers is the spectral norm of w, which is defined as:

All you have to do is to ensure after every step that the spectral norm of every weight matrix in your network is less than or equal to one; to ensure this, you just divide every weight matrix by its spectral mass

Summary

GAN training is really hard, because the discriminator can provide poor gradients. Various tricks can make this much more practical. For example, you can have: