Summary

This week Stannis gave a high-level overview of three popular families of deep generative models. The discussion is mainly based on the original papers [1][2]. The goal is to point out the commonalities and differences between these models, and have a detailed discussion on the different learning methods employed by these models.

Overview

When using latent variable models for probabilistic modeling, the objects of interest are the latent variables (which we denote by $z$), and the observed variables (which we denote by $x$).

There are two different kinds of probabilistic models: discriminative and generative. When using discriminative models, we directly model the conditional distribution $p(z|x)$. Some common examples include logistic regression and neural networks. When using generative models, we focus instead on the joint distribution $p(x, z)$, which usually factors into the product of the prior distribution $p(z)$ and the likelihood function $p(x|z)$. Some common examples include Markov random fields and Bayesian networks, and the topic of this discussion, deep generative models.

In this discussion, we would cover three popular families of deep generative models: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Flow-based Generative Models. When it comes to deep generative models, a central focus is on easy sampling, and these different models all explicitly specify a stochastic generation process for sampling from the generative models. This usually comes in the form of some easy-to-sample-from base distributions, transformed by some complicated differentiable transformations (e.g. a neural network). The main differences between these models lie in how we learn their parameters. For GANs, we use adversarial training. For VAEs, we use variational inference. And for flow-based generative models, we use maximum likelihood estimation.

In what follows, we are going to look at the three families of deep generative models in more detail. For each one of the families, we are going to look at two key aspects: the associated stochastic generation process, and the method for learning the parameters.

Generative Adversarial Networks

Stochastic Generation Process

In GANs, we start with some simple base distribution $p_z(z)$ (e.g. a multivariate Gaussian), which is easy to sample from, and a differentiable transformation $G(z; \theta_g)$, which we call the generator. The stochastic generation process involves applying the generator $G(z; \theta_g)$ to the base distribution $p_z(z)$, and the actual random variable we use to model the data is given by $x = G(z; \theta_g)$ where $z\sim p_z(z)$.

Adversarial Training

We use adversarial training to learn the parameters for GANs. Note that we are working with two different distributions, a generator distribution $p_g(x)$, which is induced by the stochastic generation process $x = G(z; \theta_g)$ with $z\sim p_z(z)$, and a data distribution $p_\text{data}(x)$, which represents the empirical distribution we get from the data. The key idea of adversarial training is to employ an additional neural network, which we call the discriminator, to train the model.

Instead of laying down the formulas for adversarial training, we mention a helpful way to think about it: the discriminator can be thought of as a binary classifier. When doing adversarial training, we always use the likelihood function associated with this binary classifier as the loss function. We alternate between updating the generator parameters and the discriminator parameters. For both purposes, the loss function is the same. The difference between these two different kinds of updates lies in what data and labels we use. When updating the generator parameters, we regard the loss function (i.e. the likelihood) as a function of the generator parameters, and we estimate the gradients using samples we get from $p_g(x)$, and we assign label $0$ to all these samples. In other words, we are working with only negative samples. When updating the discriminator parameters, we regard the loss function (again the likelihood) as a function of the discriminator parameters, and we estimate the gradients using samples from both $p_g(x)$ (to which we assign label 0) and $p_\text{data}(x)$ (to which we assign label 1). This is the usual way to train a binary classifier.

Variational Autoencoders

Stochastic Generation Process

We use the formulation in [2] as an example. In VAEs, we again start with some simple base distribution $p_z(z)$ (e.g. a multivariate Gaussian), which is easy to sample from. And we have two differentiable transformations $\mu(z; \theta)$ and $\sigma(z; \theta)$. The actual random variable we use to model the data is given by $$x_i\sim N(\mu_i(z; \theta), \sigma_i^2(z; \theta)), i=1, \cdots, d_x$$ where $d_x$ is the dimension of the data.

Variational Inference: Basics

We make use of the variational inference framework to learn the parameters for VAEs. In order to understand variational inference, we first make the observation that to learn the parameters of the model, in the ideal scenario, we would want to maximize the log marginal likelihood $$\log p_{\theta}(x)=\log \int p_{\theta}(x, z) dz$$ The problem with this approach is that evaluating the log marginal likelihood involves an integral that’s usually intractable. The central idea of variational inference is that, instead of trying to deal with this complicated integral, we look at the so-called “Evidence Lower Bound (ELBO)”. Using Jensen’s inequality, it’s easy to see that $$\log p_\theta(x) = \log \int p_\theta(x, z) dz = \log \int q_\phi(z|x) \frac{p_\theta(x, z)}{q_\phi(z|x)} dz \geq \int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} dz$$ where $$\int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} dz$$ is our ELBO. In variational inference, we want to maximize ELBO, so that we can get the tightest lower bound for the log marginal likelihood. This can also be understood as minimizing the KL divergence $D(q_\phi(z|x)||p_\theta(x, z))$.

Traditionally, variational inference is done using mean-field approximation. For VAEs, we use an “amortized” version of mean-field approximation. We introduce more differentiable transformations $\mu(x; \phi)$ and $\sigma(x; \phi)$, and use $$q_\phi(z_i|x) = N(z|\mu_i(x; \phi), \sigma_i^2(x; \phi)), i=1,\cdots, d_z$$ where $d_z$ is the dimension of the latent variable.

Reparametrization Trick

A useful technique in applying variational inference to learning parameters for VAEs is the reparametrization trick. To understand what this trick is and why it is necessary, recall that our goal is to maximize the EBLO $$\int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} dz$$ using gradient descent for both $\theta$ and $\phi$. The main difficulty here is estimating the gradients for $\phi$. A naive approach is first to note that $$\nabla_{\phi}\int q_\phi(z|x) \log \frac{p_\theta(x, z)}{q_\phi(z|x)} dz
= \int q_\phi(z|x) \nabla_{\phi} \log q_{\phi}(z|x)\left[\log \frac{p_\theta(x, z)}{q_\phi(z|x)} - 1\right] dz$$ and then draw Monte Carlo samples $z^{(i)}\sim q_{\phi}(z|x), i=1,\cdots, n$ to get an unbiased estimate of the gradients: $$\frac{1}{n}\sum_{i=1}^n \nabla_{\phi} \log q_{\phi}(z^{(i)}|x)\left[\log \frac{p_\theta(x, z^{(i)})}{q_\phi(z^{(i)}|x)} - 1\right]$$ This is also referred to as the score-function estimator.

However, there are a lot of problems with this naive approach: sometimes we can’t evaluate the score-function $\nabla_{\phi}\log q_{\phi}(z|x)$. Even if we can evaluate $\nabla_{\phi}\log q_{\phi}(z|x)$, the score-function estimator usually has very high variance. Furthermore, we can’t easily make use of automatic differentiation when we use the score-function estimator.

The solution to the above problems is the reparametrization trick. The key idea is to look at $q_\phi(z|x)$ as a parameterless base distribution $p(\epsilon)$ transformed by some differentiable transformation $g_\phi(\epsilon, x)$. For VAEs, $p(\epsilon) = N(0, I)$, and $$g_\phi(\epsilon, x) = \mu(x; \phi) + \sigma(x; \phi) \odot \epsilon$$ where $\odot$ represents elementwise multiplication.

If we can do this reparametrization, we can then sample $\epsilon^{(i)}, i=1, \cdots, n$ from $p(\epsilon)$, and use $$\frac{1}{n}\sum_{i=1}^n \log \frac{p_\theta(x, g_\phi(\epsilon^{(i)}, x))}{q_\phi(g_\phi(\epsilon^{(i)}, x)|x)}$$ as an approximation to ELBO. This way we can easily estimate stochastic gradients w.r.t. $\theta$ and $\phi$ using backpropagation.

Flow-based Generative Models

Stochastic Generation Process

In flow-based generative models, we repeat the previous pattern for the stochastic generation process. We start with some simple base distribution $p_z(z)$ (e.g. a multivariate Gaussian), which is easy to sample from. The essential idea of flow-based generative models is to have a reversible differentiable transformation $x = G(z)$, for which we can easily calculate the determinant of the Jacobian $|J G(z)|$. If this is true, then the exact probability density function is given by $$p_x(x) = \frac{p_z(z)}{|J G(z)|}, \text{where }z = G^{-1}(x)$$

Maximum Likelihood Estimation

Since we have direct access to the exact likelihood function, we can simply train such models using maximum likelihood estimation. We would go into more details concerning how we can actually specify such models in our next meeting.

References

[1] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–80. Curran Associates, Inc. link

[2] Kingma, Diederik P., and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv [stat.ML]. link