Digging Deeper Into Flow-based Generative Models
Contents
Summary
In this week’s meeting, we discussed free-form Jacobian of reversible dynamics (FFJORD)[1], and the closely related neural ordinary differential equation (neural ODE)[2]. In this blog post, we summarize the main points of these two papers.
Overview
Recall that the essential idea of flow-based generative models is to model a complicated target distribution as the result of applying a reversible differentiable transformation to some simple base distribution. The base distribution should be easy to sample from, so that we can apply the differentiable transformation and get a sample from the target distribution. We should be able to easily evaluate the likelihood of the base distribution and the determinant of the Jacobian of the differentiable transformation, so that we can apply the change-of-variable formula and get the exact likelihood for the target distribution.
Traditional flow-based generative models place awkward restrictions on the reversible differentiable transformations, so as to make the evaluation of the determinant of the Jacobian efficient. From the perspective of generative models, the key contribution of neural ODE and FFJORD is the introduction of Continuous Normalizing Flows (CNF), which is a continuous analog of normalizing flows. It provides us with a richer family of generative models while maintaining computational efficiency.
In what follows, we are going to present some background materials on neural ODE, to set the stage for our discussion. Then we are going to talk about CNF and the trick (Hutchinson’s trace estimator) used in FFJORD for the efficient evaluation of exact likelihood.
Neural ODE
Setup
The essential idea of neural ODE is to specify ordinary differentiable equations (ODEs) whose vector fields are parametrized by neural networks, and define complicated differentiable transformations as solutions to initial value problems involving such ODEs. Concretely, the ODE is specified as
To make such differentiable transformations useful, given some scalar-valued function $L$, an important problem is to calculate the gradients of $L (T_{t_0, t_1} (z_0)) = L (z (t_1))$ w.r.t. $z_0, t_0$ and $\theta$.
A Naive Way to Calculate Gradients
A naive way to calculate such gradients is to make use of the operations of the ODE solver: if we think of the operations of the ODE solver as defining a neural network that’s adapted to the initial condition and the required accuracy, we can simply record these operations and use back-propagation to calculate the gradients, much like what we do for traditional neural networks.
The problem with this approach is that we need to store all the intermediate operations of the ODE solver, incurring high memory costs when we are dealing with complicated models.
The Adjoint Method: Continuous Analog of Back-propagation
The key contribution of [2] is to introduce the adjoint method as an alternative way to compute these gradients, without the need to access intermediate operations of an ODE solver. The adjoint method is a continuous analog of the back-propagation algorithm.
To simplify the presentation, assume we are working with the ODE \[ \frac{\mathrm{d} h (t)}{\mathrm{d} t} = g (h (t)) \] and we have some scalar-valued function $L$. Given some $t_1 > 0$, the key insight of the adjoint method is that the evolution of the so-called adjoint $a (t) = \frac{\partial L (h (t_1))}{\partial h (t)}$ (a column vector in our presentation) can be described by another ODE \[ \frac{\mathrm{d} a (t)}{\mathrm{d} t} = - \left[ \frac{\partial g (h (t))}{\partial h (t)} \right]^T a (t) \] where $\frac{\partial g (h (t))}{\partial h (t)}$ is the Jacobian matrix. Note that the adjoint method is necessary because for $t \neq t_1$, \[ a (t) = \frac{\partial L (h (t_1))}{\partial h (t)} = \left[ \frac{\partial h (t_1)}{\partial h (t)} \right]^T \frac{\partial L (h (t_1))}{\partial h (t_1)} \] We can readily evaluate $\frac{\partial L (h (t_1))}{\partial h (t_1)}$, but the Jacobian matrix $\frac{\partial h (t_1)}{\partial h (t)}$ is nontrivial to calculate.
Using the adjoint method, for some given $h_1$ at $t_1$, we can evaluate $a (t_0) = \frac{\partial L (h (t_1))}{\partial h (t_0)}$ by applying an ODE solver to the initial value problem \begin{align} \frac{\mathrm{d}}{\mathrm{d} t} \begin{bmatrix} h (t)\\ a (t) \end{bmatrix} & = \begin{bmatrix} g (h(t))\\ - \left[ \frac{\partial g (h (t))}{\partial h (t)} \right]^T a (t) \end{bmatrix}\\ \begin{bmatrix} h (t_1)\\ a (t_1) \end{bmatrix} & = \begin{bmatrix} h_1\\ \frac{\partial L (h_1)}{\partial h_1} \end{bmatrix} \end{align}
Applying the Adjoint Method to Neural ODE
To apply the adjoint method to calculate the gradients in neural ODE, the key is to note that we can regard $t$ and $\theta$ as also part of the states, and obtain an augmented ODE \[ \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} z (t)\\ t\\ \theta \end{array}\right] = \left[\begin{array}{c} f (z (t), t, \theta)\\ 1\\ 0 \end{array}\right] \] For a scalar-valued function $L$, applying the adjoint method, the ODE for $\frac{\partial L (z (t_1))}{\partial z}, \frac{\partial L (z (t_1))}{\partial t}, \frac{\partial L (z (t_1))}{\partial \theta}$ is given by \begin{align} \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} \frac{\partial L (z (t_1))}{\partial z (t)}\\ \frac{\partial L (z (t_1))}{\partial t}\\ \frac{\partial L (z (t_1))}{\partial \theta} \end{array}\right] & = - \left[\begin{array}{ccc} \frac{\partial f (z (t), t, \theta)}{\partial z (t)} & \frac{\partial f (z (t), t, \theta)}{\partial t} & \frac{\partial f (z (t), t, \theta)}{\partial \theta}\\ 0 & 0 & 0\\ 0 & 0 & 0 \end{array}\right]^T \left[\begin{array}{c} \frac{\partial L (z (t_1))}{\partial z (t)}\\ \frac{\partial L (z (t_1))}{\partial t}\\ \frac{\partial L (z (t_1))}{\partial \theta} \end{array}\right]\\ & = - \left[\begin{array}{c} \left[ \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)}\\ \left[ \frac{\partial f (z (t), t, \theta)}{\partial t} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)}\\ \left[ \frac{\partial f (z (t), t, \theta)}{\partial \theta} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)} \end{array}\right] \end{align} Given some $z_1$ and $t_1$, we can evaluate $\frac{\partial L (z (t_1))}{\partial z (t_0)}, \frac{\partial L (z (t_1))}{\partial t}, \frac{\partial L (z (t_1))}{\partial \theta}$ by applying the adjoint method, solving the initial value problem \begin{align} \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} z (t)\\ \frac{\partial L (z (t_1))}{\partial z (t)}\\ \frac{\partial L (z (t_1))}{\partial t}\\ \frac{\partial L (z (t_1))}{\partial \theta} \end{array}\right] & = \left[\begin{array}{c} f (z (t), t, \theta)\\ - \left[ \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)}\\ - \left[ \frac{\partial f (z (t), t, \theta)}{\partial t} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)}\\ - \left[ \frac{\partial f (z (t), t, \theta)}{\partial \theta} \right]^T \frac{\partial L (z (t_1))}{\partial z (t)} \end{array}\right]\\ \left[\begin{array}{c} z (t_1)\\ \frac{\partial L (z (t_1))}{\partial z (t_1)}\\ \frac{\partial L (z (t_1))}{\partial t_1}\\ \frac{\partial L (z (t_1))}{\partial \theta} \end{array}\right] & = \left[\begin{array}{c} z_1\\ \frac{\partial L (z_1)}{\partial z_1}\\ \left[ \frac{\partial L (z_1)}{\partial z_1} \right]^T f (z (t_1), t_1, \theta)\\ 0 \end{array}\right] \end{align}
FFJORD
Continuous Normalizing Flows (CNF) as a Generative Model
The essential idea of CNF is to use a neural ODE as the reversible differentiable transformation. Concretely, the sampling process is done by first sampling $z_0$ from some simple base distribution, and then solve the initial value problem \begin{align} \frac{\mathrm{d} z (t)}{\mathrm{d} t} & = f (z (t), t, \theta)\\ z (t_0) & = z_0 \end{align} and use $z (t_1)$ for some given $t_1$ as the sample from the CNF.
Instantaneous Change-of-variable Formula
An important aspect of flow-based generative models is the ability to evaluate the exact likelihood. For CNF, this is made possible by the so-called instantaneous change-of-variable formula.
To understand this formula, assume we have a system whose dynamics is described by the ODE \[ \frac{\mathrm{d} z (t)}{\mathrm{d} t} = f (z (t), t, \theta) \] If we start with some probability distribution $p_0 (z)$ at some initial time $t_0$, using $p (z, t)$ to denote the probability that we find the system in state $z$ at time $t$, the evolution of $p (z, t)$ is described by the Liouville equation \[ \frac{\partial p (z, t)}{\partial t} = -\text{tr} \left( \frac{\partial}{\partial z} [f (z, t, \theta) p (z, t)] \right) \] If we can solve this PDE with initial condition $p (z, 0) = p_0 (z)$, we would also be able to evaluate the exact likelihood. But in practice, numerical solution of this PDE won’t be practical for complicated systems.
The key insight of the so-called instantaneous change-of-variable formula is that, if, instead of trying to directly solve for $p (z, t)$, we look at a simpler function $q (t) = p (z (t), t)$, where the dynamics of $z (t)$ is given by \[ \frac{\mathrm{d} z (t)}{\mathrm{d} t} = f (z (t), t, \theta) \] we can avoid solving any PDEs and instead need to solve only an ODE. Concretely, following the derivations in Appendix A.2 of [2], we can see that \begin{align} & \frac{\mathrm{d} q (t)}{\mathrm{d} t}\\ = & \left[ \frac{\partial p (z (t), t)}{\partial z (t)} \right]^T \frac{\partial z (t)}{\partial t} + \frac{\partial p (z (t), t)}{\partial t}\\ = & \left[ \frac{\partial p (z (t), t)}{\partial z (t)} \right]^T \frac{\partial z (t)}{\partial t} -\text{tr} \left( \frac{\partial}{\partial z (t)} [f (z (t), t, \theta) p (z (t), t)] \right) (\text{Liouville})\\ = & \left[ \frac{\partial p (z (t), t)}{\partial z (t)} \right]^T \frac{\partial z (t)}{\partial t} -\text{tr} \left( \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right) p (z (t), t) - \left[ \frac{\partial p (z (t), t)}{\partial z (t)} \right]^T f (z (t), t, \theta)\\ = & -\text{tr} \left( \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right) q (t) \end{align} This implies that \[ \frac{\mathrm{d} \log q (t)}{\mathrm{d} t} = -\text{tr} \left( \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right) \] which is the so-called instantaneous change-of-variable formula.
Using this formula, to solve for $q (t)$, we can simply look at the ODE \[ \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} z (t)\\ \log q (t) \end{array}\right] = \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} z (t)\\ \log p (z (t), t) \end{array}\right] = \left[\begin{array}{c} f (z (t), t, \theta)\\ -\text{tr} \left( \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right) \end{array}\right] \]
Likelihood Evaluation and Maximum Likelihood Estimation
Coming back to the setting of a generative model, assume our base distribution is given by $p_0 (z)$, and our reversible differentiable transformation is given by the neural ODE \[ \frac{\mathrm{d} z (t)}{\mathrm{d} t} = f (z (t), t, \theta) \] with start time $t_0$ and end time $t_1$. In order to make this generative model useful, we need to be able to
- Evaluate the exact log-likelihood for a given sample $z_1$
- Calculate the gradient of the log-likelihood w.r.t. the parameter $\theta$ for maximum likelihood estimation using gradient-based methods
The exact log-likelihood function at time $t_1$ is given by $\log p (z, t_1)$. Define \[ \Delta l (t) = \log p (z (t_1), t_1) - \log p (z (t), t) \] Using the instantaneous change-of-variable formula, it’s easy to see that \[ \frac{\mathrm{d}}{\mathrm{d} t} \left[\begin{array}{c} z (t)\\ \Delta l (t) \end{array}\right] = \left[\begin{array}{c} f (z (t), t, \theta)\\ \text{tr} \left( \frac{\partial f (z (t), t, \theta)}{\partial z (t)} \right) \end{array}\right] \] Solving this ODE with initial condition \[ \left[\begin{array}{c} z (t_1)\\ \Delta l (t_1) \end{array}\right] = \left[\begin{array}{c} z_1\\ 0 \end{array}\right] \] gives us $z (t_0)$ and $\Delta l (t_0)$. Note that $\log p (z (t_1), t_1)$ is equal to a scalar-valued function $L$ operating on $\left[\begin{array}{c} z (t)\\ \Delta l (t) \end{array}\right]$ at time $t$, where \[ L \left( \left[\begin{array}{c} z (t)\\ \Delta l (t) \end{array}\right] \right) = \log p (z (t), t) + \Delta l (t) \] In particular, since we know $p (z, t_0) = p_0 (z)$, we can evaluate $\log p (z (t_1), t_1)$ as \[ \log p (z (t_1), t_1) = L \left( \left[\begin{array}{c} z (t)\\ \Delta l (t) \end{array}\right] \right) =\log p_0 (z (t_0)) + \Delta l (t_0) \] Furthermore, $L$ is just some scalar-valued function operating on the state of this ODE, so we can easily apply the adjoint method to calculate its gradients w.r.t. $z_1, t$ and, in particular, $\theta$. Being able to calculate the gradients w.r.t. $\theta \,$ allows us to use gradient-based methods for maximum likelihood estimation.
Trace Calculation and Hutchinson’s Trace Estimator
In order to carry out the exact log-likelihood evaluation and maximum likelihood estimation, we need to be able to evaluate the trace $\text{tr} \left( \frac{\partial f (z, t, \theta)}{\partial z} \right)$. In the CNF presented in [2], this is done in a naive way using automatic differentiation. The problem with this naive approach is that it’s not very efficient: in order to calculate the trace, for a $D$ dimensional system, we need to do $D$ forward/backward passes, using $f_i (z, t, \theta), i = 1, \ldots, D$ as the loss function in each pass.
The key contribution of FFJORD is the use of the so-called Hutchinson’s trace estimator to make the trace evaluation more efficient. The Hutchinson’s trace estimator comes from the simple observation that, for any $D \times D$ square matrix $A$, if we have a $D$-dimensional random vector $\varepsilon$ which satisfies $\mathbb{E} (\varepsilon) = 0$ and $\text{Cov} (\varepsilon) = I$, then \[ \mathbb{E} (\varepsilon^T A \varepsilon) =\mathbb{E} \left[ \sum_{i, j = 1}^D A_{ij} \varepsilon_i \varepsilon_j \right] = \sum_{i, j = 1}^D A_{ij} \mathbb{E} (\varepsilon_i \varepsilon_j) =\text{trace} (A) \] This means we can estimate $\text{trace} (A)$ using a Monte Carlo estimator $\frac{1}{N} \sum_{i = 1}^N \varepsilon^{(i) T} A \varepsilon^{(i)}$, where $\varepsilon^{(1)}, \ldots, \varepsilon^{(N)}$ are i.i.d. samples of the random vector $\varepsilon$.
The efficiency gain comes from the fact that vector-Jacobian product $v^T \frac{\partial f (z, t, \theta)}{\partial z}$ is much faster to calculate: instead of $D$ forward/backward passes, we can use only one pass with $v^T f (z, t, \theta)$ as the loss function to get the vector-Jacobian product, and use a subsequent vector dot product to evaluate $v^T \frac{\partial f (z, t, \theta)}{\partial z} v$, for any $D$-dimensional vector $v$. This makes FFJORD much more efficient than vanilla CNF when $D$ is large.
References
[1] Grathwohl, Will, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018. “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1810.01367. link
[2] Chen, Ricky T. Q., Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. 2018. “Neural Ordinary Differential Equations.” arXiv [cs.LG]. arXiv. link
Author Guangyao Zhou
LastMod 2019-04-04
Comments powered by Talkyard.