# Problems and Advances of Wasserstein GAN

## Contents

# Introduction

Since Generative Adversarial Nets(GAN)([1]) was proposed in 2014, there have been a lot of researches on and applications of GAN([2,3]). However the generative and discriminative models were studied before the GAN was proposed([4]). Some problems of GAN are summarized in [5].

The basic idea of classical GAN is to minimize the Kullback-Leibler(KL) divergence. However, it is possible that the KL divergence (or distance) can not be defined (or is simply infinite). In order to overcome the drawback, the Wasserstein distance is introduced to measure the distance between the generated distribution and the learned distribution([6]). This approach is referred to as WGAN.

WGAN is closely related to optimal transport problem proposed by G. Monge in 1781. The optimal transport problem has been researched mathematically in [7,8]. The relationship between WGAN and optimal transport is introduced in detail in [9,10,11].

The computation involved WGAN is a difficult problem. Some approaches are introduced in [11, 12, 13,14,15]. In [11], the optimal transport problem is solved by convex geometry. In [12,13], entropic regularization of the primal optimal transport problem results in a smooth dual optimization which can be solved by algorithms that have a provably faster convergence. In [14], WGAN is tackled by formulating a distributionally robust stochastic optimization problem which can be solved by stochastic gradient descent. In [15], a fast and scalable algorithm to project a given density on a set of structured measure defined over a compact 2D domain is proposed. In [16], WGAN is discussed as a zero-sum game between the generator and the discriminator, and bi-linear zero-sum games are solved by optimistic mirror descent.

In this blog, I will briefly introduce the basic concepts of WGAN and the computation involved.

$$ \min_{G} \ \max_{D}V(D,G)= \mathbb{E}_{x\sim p_{data}(x)}\big[ \log D(x) \big] + \mathbb{E}_{z\sim p_{z}(z)} \big[ \log (1- D(G(z))) \big] $$

By sampling from a uniform distribution, generator tries to capture the data distribution, whereas discriminator estimates whether a data sample comes from the real distribution or G. Generator is learned to maximize the probability of D making a mistake.

Since the first paper, GAN variants have shown promising results; however, they suffer from several problems, too. The balance of generator and discriminator is difficult, and there are many situations where gradients may diminish, or the model oscillate and never converges. Even in the success case, there is no guarantee that the latent space represents each node of the real data distribution (mode collapse). Furthermore, the training of GANs is highly sensitive to hyperparameter selection.

# Wasserstein GAN

## Some definitions

$\mathcal X$ denotes the data space in $R^{p}$ (for example: image space). $\mathcal P(\mathcal X)$ is the space of all probability distribution on $\mathcal X$ (object image set). Assume the learned data distribution is $\nu \in \mathcal P(\mathcal X)$. In fact $\nu$ is replaced with its empirical distribution $\nu_n$ over samples ${x_1, \dots, x_n}$， where $\nu_n(x)=\frac{1}{n}\sum_{i=1}^n \delta (x-x_i)$, and $\delta$ is a Dirac function or a Gaussian Parzen window function.

A **Generator** learns the data distribution $\nu$. Its inputs are random noise variables $z$ with distribution $\zeta$, where $z\in \mathcal Z \subseteq R^m$ and $\zeta \in \mathcal P(\mathcal Z)$, and its outputs are $g_\theta(z)\in \mathcal X$. $g_\theta$ is realized by a deep neural network with parameters $\theta$. Because $g_\theta$ is
a continuous map, $g_\theta$ induces a pushforward operator $g^{f}_{\theta}$ so that $g^{f}_{\theta}\zeta \in \mathcal P(\mathcal X)$ (for any set $B\subset \mathcal X$, $g^{f}_{\theta}\zeta(B)=\zeta({z;g(z)\in B})$ ).

A **Discriminator** $D(x;\eta)$ ($x\in \mathcal X$) outputs a single scalar that represents the probability that $x$ came from the data rather than $g_\theta$. $D(x;\eta)$ is realized by a deep neural network with parameters $\eta$.

## GAN

GAN is the following two-player minimax game with value function $$\min \limits_\theta \max \limits_\eta V(\theta,\eta)$$ where $$V(\theta,\eta)=\mbox{E}_{x\sim \nu} [\log D(x;\eta)]+\mbox{E}_{z\sim \zeta} [\log (1-D(g_\theta(z);\eta))](1)$$

- The global minimum of Eq.1 is achieved if and only if $g^{f}_ {\theta}\zeta = \nu $.([1])
- At the point of global minimum of Eq.1, its value is $-\log 4$.([1])

A key problem of GAN is ‘mode collapse’, where the generator does not produce points from certain modes, or for at least one mode the generator only partially covers the mode.

## Wasserstein distance

$$ W_c (\mu, \nu) = \min \limits_{\gamma \in \mathcal P(\mathcal X \times \mathcal X)} {\int_{\mathcal X \times \mathcal X}c(x, y)d\gamma(x,y)}\text{ }(2) $$ where $c(x,y)$: $\mathcal X \times \mathcal X \to R$ is the cost function , and for the joint distribution $\gamma(x,y)$ its first and second marginal distributions are $\mu$ and $\nu$.

## Wasserstein GAN

The discriminator uses Wasserstein distance between distributions.

The generator is given by the Minimum Kantorovitch Estimator $$ \min \limits_{\theta} E(\theta)=W_c (g^{f}_{\theta}\zeta, \nu)\text{ }(3) $$

The GAN model is interpreted by the optimal transport theory which concerns the problem of connecting two probability distributions vis transportation at a minimal cost.

Because (2) is a linear program, (3) has a dual formulation known as Kantorovich problem: $$ E(\theta)= \max \limits_{h,\tilde{h}} {\int_{\mathcal Z}h(g_{\theta}(z))d\zeta(z)+ \int_{\mathcal X}\tilde{h}(y)d\nu(y)}\text{ }(4) $$ where $(h,\tilde{h})$ are continuous functions on $\mathcal X$ called Kantorovich potentials satisfying $h(x)+\tilde{h}\leq c(x,y)$.

## Some results

- The cost of any pair $(h,\tilde{h})$ can always be improved by replacing $\tilde{h}$ by the $c$-transform $h^c$ of $h$ defined as $$ h^c(y)=\max \limits_{x} (c(x,y)-h(x)) $$ Therefore (4) can parameterize as depending on only one potential function.
- For $L^1$ transportation cost $c(x,y)=|x-y|$ in $R^p$, if $h$ is 1-Lipschitz (a deep neural network made of ReLU units whose Lipschitz constant is upper-bounded by 1), then $h^c=-h$.
- For $L^2$ transportation cost $c(x,y)=\frac{||x-y||^2}{2}$ in $R^p$, then $$ h^c(y)=\frac{||y||^2}{2}-\sup\limits_{x} \left(<x,y> - \left(\frac{||x||^2}{2}-h(x)\right)\right). $$
- If for the transportation cost $c(x,y)=h(x-y)$, where $h$ is a strictly convex function, then once the optimal discriminator is obtained, the generator can be written down in an explicit formula.

# Computation

**Approach 1** Stochastic gradient descent for large-scale optimal transport. Since $\nu$ is discrete, the continuous potential $\tilde{h}$ can be replaced by the discrete vector $(\tilde{h}(y_j))$, and $h=(\tilde{h})^c$. The optimization over $\tilde{h}$ can then be achieved using Stochastic gradient descent. ([12,13])

**Approach 2** The dual potential $h$ is restricted to have a parametric form $h_\xi$:
$\mathcal X\to R$, where $h_\xi$ is represented by a discriminative deep neural network. (4) is computed by (c.f. [6])
$$
\min \limits_{\theta} \max \limits_{\xi}
{\int_{\mathcal Z}h_\xi(g_{\theta}(z))d\zeta(z)+
\frac{1}{n}\sum_{j=1}^n{h_{\xi}^c(y_j)}}
$$

# Conclusion

WGAN is established on solid mathematics. Research of fast and practical algorithms computing Wasserstein distance is an important problem.

# References

[1] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” International Conference on Neural Information Processing Systems (NeurIPS), 2014.

[2] He Huang, Philip S. Yu, Changhu Wang, “An introduction to image synthesis with generative adversarial nets,” arXiv:1803.04469v2, 2018.

[3] Lei Wang. Generative models for physicists. http://wangleiphy.github.io/

[4] Ying Nian Wu, Ruiqi Gao, Tian Han, Song-Chun Zhu, A tale of three probabilistic families: discriminative, descriptive, and generative models, Quarterly of Applied Mathematics, 77(2): 423-465, 2019.

[5] Augustus Odena, Open questions about generative adversarial networks ,http://distill.pub/2019/gan-open-problems/

[6] Martin Arjovsky, Soumith Chintala, Léon Bottou, “Wasserstein Generative Adversarial Networks,” Proceedings of the 34th International Conference on Machine Learning, PMLR 70:214-223, 2017.

[7] L. Ambrosio, “Optimal transport maps in Monge-Kantorovich problem”, ICM2002, arXiv: math/0304389v1, 2003.

[8] Cédric Villani, “Optimal transport: old and new, v.338, Springer, 2008.

[9] Bruno Lévy, Erica L. Schwindt, “Notions of optimal transport theory and how to implement them on a computer”, Computers & Graphics, 72: 135-148, 2018.

[10] Aude Genevay, Gabriel Peyré, Marco Cuturi, “GAN and VAE from an optimal transport point of view,” arXiv: 1706.01807, 2017.

[11] Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, Xianfeng David Gu, “A geometric view of optimal transportation and generative model”, Computer Aided Geometric Design, 68: 1-21, 2019.

[12] Aude Genevay, Marco Cuturi, Gabriel Peyré, Francis Bach “Stochastic optimization for large-scale optimal transport,” arXiv: 1605.08527v1, 2016.

[13] Marco Cuturi, Gabriel Peyré, Semidual regularized optimal transport. SIAM Review, 60(4): 941-965, 2018.

[14] Rui Gao, Xi Chen, Anton J. Kleywegt, Distributional robustness and regularization in statistical learning, arXiv:1712.06050v2, 2017.

[15] Léo Lebrat, Frédéric de Gournay, Jonas Kahn, Pierre Weiss, “Optimal transport approximation of 2-dimensional measures”, SIAM J. Imaging Sciences, 12(2): 762-787, 2019.

[16] Constantinos Daskalakis，Andrew Ilyas, Vasilis Syrgkanis, Haoyang Zeng, “Training GAN with optimism”, arXiv: 1711.00141, 2018.

Author Jian-Zhou Zhang

LastMod 2019-04-11

Comments powered by Talkyard.