18 Generative Models

18.1 Boltzmann Machines

Boltzmann machine is an energy-based model: we define the joint probability distribution as follows:

P(x) = \dfrac{exp(-E(x))}{\sum_x exp(-E(x))} \quad E(x) = -x^TUx-b^Tx

where E(x) is the energy function, U is the weight matrix of model parameters and b is the vector of bias parameters.

Restricted Boltzmann Machines are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables.

The energy function is given by

E(x, h) = - h^T Wx - b^T x - c^T h

b^T: visible layer bias
c^T: hidden layer bias
W models the interaction between the visible and hidden layers

The difference with Boltzmann machines is that connections between nodes in the same layer are not allowed.

The goal of training is to adjust the RBM’s weights so that the model learns the underlying probability distribution of the training data. This is done by making real data configurations have low energy (high probability) and other configurations have high energy. The training algorithm is called contrastive divergence, where the loss function is made of a positive phase and a negative phase:

The positive phase strengthens the connections that help represent the real data.
The negative phase weakens the connections that produce the model’s hallucinations.

To do that, first of all in the first forward pass we compute the positive phase, in the backward we construct a ‘dreamed’ version of how data looks like according to the model to finally do a second forward pass on that new version of the data. The weight change is given by the difference between the real data correlations and the reconstructed correlations.

18.2 Generative Adversarial Networks

Recall that we can think of such a generative model in terms of a distribution p(x|w) in which x is a vector in the data space, and w represents the learnable parameters of the model. Consider a generative model based on a nonlinear transformation from a latent space z to a data space x. We introduce a latent distribution p(z) along with a nonlinear transformation x = g(z, w) defined by a neural network with learnable parameters w known as the generator. Together these implicitly define a distribution over x. However, we cannot determine w by optimizing the likelihood function because this cannot be evaluated in closed form.

The key idea of generative adversarial network is to introduce a second network called discriminator, which is trained jointly with the generator network. The goal of the discriminator network is to distinguish between real examples and synthetic examples produced by the generator. The goal of the generator network is to maximize the discriminator’s error by synthesizing examples from the training set distribution. The generator and discriminator are therefore working against each other. This is an example of zero-sum game, in which any gain by one network represents a loss to the other.

The discriminator network has a single output unit with a sigmoid activation, whose output represents the probability that a data vector x is real: P(t = \text{real}) = d(x, \phi)

We define t = real as t = 1 and t = synthetic as t = 0, so we can write the cross-entropy as follows:

E_{GAN} (w, \phi) = - \dfrac{1}{N_{real}} \sum_{n \in real} \ln d(x_n, \phi) - \dfrac{1}{N_{synth}} \sum_{n \in synth} \ln ( 1 - d(g(z_n, w), \phi))

The unusual aspect is the adversarial training whereby the error is minimized with respect to \phi but maximized with respect to w. This can be done using gradient methods with the sign of the gradient reversed: \Delta \phi = - \lambda \nabla_{\phi} E_n(w, \phi) \quad \quad \Delta w = \lambda \nabla_w E_n (w, \phi)

Once the GAN is trained, the discriminator network is discarded and the generator network can be used to synthesize new examples in the data space by sampling from the latent space.

18.3 Normalizing Flows

Normalizing flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact. A normalizing flow is a transformation of a simple probability distribution into a more complex distribution by a sequence of invertible and differentiable mappings. The result of this approach is a mechanism to construct new distributions by choosing an initial density and then chaining together some number of parameterized, invertible and differentiable transformations.