Generative Adversarial Nets
Problems and Advances
Jian-Zhou Zhang
College of Computer
Sichuan University
China
Generative Model Discussion Group
Semester Program on Computer Vision
ICERM
April 11, 2019
Outline
I
References
I
Generative Adversarial Nets (GAN)
Model
Theoretical Results
Algorithm and Convergence
Problems
I
Wasserstein GAN (WGAN) from An Optimal Transport Point of
View
GAN Problem
Wasserstein GAN (WGAN)
Some results
Computation
Remarks
I
Wasserstein GAN from Zero-Sum Game Point of View
Model
Optimistic Mirror Decent (OMD)
2
Outline
I
Wasserstein GAN from Zero-Sum Game Point of View
Model
Optimistic Mirror Decent (OMD)
Theoretical Results
Experiment findings
Remarks
Future research
3
References
I
Ian J. Goodfellow et al. Generative adversarial nets. NIPS, 2014,
pp.2672-2680. (arXiv: 1406.2661)
I
He Huang et al. An introduction to image synthesis with generative
adversarial nets. 2018, arXiv: 1803.04469v2
I
Ying Nian Wu et al. A tale of three probabilistic families:
discriminative, descriptive, and generative models. Quarterly of
Applied Mathematics, 2019, 77(2): 423-465.
I
Martin Arjovsky et al. Wasserstein GAN. 2017, arXiv: 1701.07875
I
Aude Genevay et al. GAN and VAE from an optimal transport point
of view. 2017, arXiv: 1706.01807
I
Na Le et al. A geometric view of optimal transport and generative
model. Computer Aided Geometric Design, 2019, 68: 1-21.
I
Marco Cuturi, and Gabriel Peyre. Semidual regularized optimal
transport. SIAM Review, 2018, 60(4): 941-965.
4
References
I
Constantinos Daskalakis et al. Training GAN with optimism. 2018,
arXiv: 1711.00141
I
Thomas Unterthiner et al. Coulomb GANs: provably optimal Nash
equilibria via potential fields. 2018, arXiv: 1708.08819v3
I
Martin Heusel et al. GANs trained by a two time-scale update rule
converge to a local Nash equilibrium. 2018, arXiv: 1706.08500v6
I
Frans A. Oliehoek et al. Beyond local Nash equilibria for adversarial
networks. 2018, arXiv: 1806.07268v2
I
Lei Wang. Generative models for physicists.
http://wangleiphy.github.io/
5
Generative Adversarial Nets (GAN)
I
Model
X denotes the data space in R
p
(for example: image space). P(X )
is a space of all probability distribution on X (object image set).
Assume the learned data distribution is ν P(X ). In fact ν is
replaced with its empirical distribution ν
n
over samples {x
1
, . . . , x
n
}
where ν
n
(x) =
1
n
P
n
i=1
δ(x x
i
), and δ is a Dirac function or a
Gaussian Parzen window function.
A Generator learns the data distribution ν. Its inputs take from
random noise variables z with distribution ζ, where z Z R
m
and
ζ P(Z), and its outputs are g
θ
(z) X . g
θ
is realized by a deep
neural network with parameters θ. Because g
θ
is a continuous map,
g
θ
induces a pushforward operator g
θ#
so that g
θ#
ζ P(X ) (for
any set B X , g
θ#
ζ(B) = ζ({z; g(z) B}) ).
A Discriminator D(x; η) (x X ) outputs a single scalar that
represents the probability that x came from the data rather than g
θ
.
D(x; η) is realized by a deep neural network with parameters η.
6
Generative Adversarial Nets (GAN)
I
Model
GAN is the following two-player minimax game with value function
V (θ, η) = E
xν
[log D(x; η)] + E
zζ
[log(1 D(g
θ
(z); η))],
min
θ
max
η
V (θ, η) (1)
I
Theoretical Results
The global minimum of Eq.1 is achieved if and only if g
θ#
ζ = ν.
At the point of global minimum of Eq.1, its value is log 4.
I
Algorithm and Convergence
Minibatch stochastic gradient descent training of generative
adversarial nets.
If the generator and discriminator have enough capacity, and at each
step of algorithm, the discriminator is allowed to reach its optimum
given generator, and g
θ
is updated so as to improve the criterion
V (θ, η) = E
xν
[log D(x; η)] + E
zζ
[log(1 D(g
θ
(z); η))], then
g
θ#
ζ converges to ν.
7
Generative Adversarial Nets (GAN)
I
Problems
In order to avoid ”the Helvetica scenario” in which the generative
function collapses too many values of z to the same value of x to
have enough diversity to model g
θ#
ζ, the discriminator must be
trained more frequently than the generator.
If the discriminator learns to accurately classify all samples, this leads
to vainishing gradients for the generator and inability to train in a
stable manner.
The training process of GAN is not stable.
There is no explicit representation of g
θ#
ζ.
8
Wasserstein GAN (WGAN)
from An Optimal Transp ort Point of View
I
GAN Problem
If the discriminator learns to accurately classify all samples, this leads
to vainishing gradients for the generator and inability to train in a
stable manner.
I
Wasserstein GAN (WGAN)
Idea: the discriminator rather than being treated as a classifier is
instead trying to approximate a metric between the true distribution
and the generative distribution in the distribution space P(X ).
Wasserstein distance:
W
c
(µ, ν) = min
γ∈P(X ×X )
Z
X ×X
c(x, y)(x, y)
(2)
where the cost function c(x, y): X × X R, and for the joint
distribution γ(x, y) its first and second marginal distributions are µ
and ν.
9
Wasserstein GAN (WGAN)
from An Optimal Transp ort Point of View
I
Wasserstein GAN (WGAN)
The discriminator uses Wasserstein distance between distributions.
The generator is Minimum Kantorovitch Estimator
min
θ
E(θ) = W
c
(g
θ#
ζ, ν) (3)
The GAN model is interpreted by the optimal transport theory which
concerns the problem of connecting two probability distributions vis
transportation at a minimal cost.
Because (2) is a linear program, (3) has a dual formulation known as
Kantorovich problem:
E(θ) = max
h,
˜
h
Z
Z
h(g
θ
(z))(z) +
Z
X
˜
h(y)(y)
(4)
where (h,
˜
h) are continuous functions on X called Kantorovich
potentials satisfying h(x) +
˜
h c(x, y).
10
Wasserstein GAN (WGAN)
from An Optimal Transp ort Point of View
I
Some results
The cost of any pair (h,
˜
h) can always be improved by replacing
˜
h by
the c-transform h
c
of h defined as
h
c
(y) = max
x
(c(x, y) h(x)) (5)
Therefore (4) can parameterize as depending on only one potential
function.
For L
1
transportation cost c(x, y) = |x y| in R
p
, if h is 1-Lipschitz
(a deep neural network made of ReLU units whose Lipschitz constant
is upper-bounded by 1), then h
c
= h.
For L
2
transportation cost c(x, y) =
kxyk
2
2
in R
p
, then
h
c
(y) =
k y k
2
2
sup
x
< x, y >
k x k
2
2
h(x)

. (6)
11
Wasserstein GAN (WGAN)
from An Optimal Transp ort Point of View
I
Some results
If For the transportation cost c(x, y) = h(x y), where h is a strictly
convex function, then once the optimal discriminator is obtained, the
generator can be written down in an explicit formula.
I
Computation
Approach 1: Stochastic gradient descent for large-scale optimal
transport. Since ν is discrete, the continuous potential
˜
h can be
replaced by the discrete vector (
˜
h(y
j
)), and h = (
˜
h)
c
. The
optimization over
˜
h can then be achieved using Stochastic gradient
descent.(A.Genevay et al, arXiv:1605.08527v1)
Approach 2: The dual potential h is restricted to have a parametric
form h
ξ
: X R, where h
ξ
is represented by a discriminative deep
neural network. (4) is computed by
min
θ
max
ξ
Z
Z
h
ξ
(g
θ
(z))(z) +
1
n
n
X
j=1
h
c
ξ
(y
j
)
(7)
(Martin Arjovsky et al, arXiv: 1701.07875)
12
Wasserstein GAN (WGAN)
from An Optimal Transp ort Point of View
I
Computation
Approach 3: Semidual regularized optimal transport via entropic
regularization.(SIAM Review, 2018, 60(4): 941-965.)
Approach 4: The optimal transport problem can be solved via
convex geometry.(Computer Aided Geometric Design, 2019, 68:
1-21.)
I
Remarks
Remark 1: The goal of generative modeling is to capture the
probability distribution of high dimensional data and generate new
samples according to the learned distribution.
Remark 2: For optimal transport problem, one fixes both end
probability distributions, and aims at minimizing the transportation
cost.
13
Wasserstein GAN (WGAN)
from Zero-Sum Game Point of View
I
Model
(previously) For L
1
transportation cost c(x, y) = |x y| in R
p
, if h
is 1-Lipschitz (a deep neural network made of ReLU units whose
Lipschitz constant is upper-bounded by 1), then h
c
= h.
Then WGAN is the following two-player minimax game with value
function L(θ, η) = E
xν
[D(x; η)] E
zζ
[D(g
θ
(z); η)],
min
θ
max
η
L(θ, η) (8)
Some examples show that Gradient Descent(GD) dynamics always
lead to a limit cycle irrespective of the step size or other
modifications (for example Gradient Penalty, Momentum, Nesterov
Momentum) around the equilibrium.
14
Wasserstein GAN (WGAN)
from Zero-Sum Game Point of View
I
Optimistic Mirror Decent (OMD)(arXiv: 1711.00141)
The update rule for OMD given (8):
η
t+1
= η
t
+ 2α
η,t
α
η,t1
θ
t+1
= θ
t
2α
θ,t
+ α
θ,t1
(9)
where α is learning rate or step size, and at t-th step, for the learned
empirical distribution ν
n
t
over samples {x
t
1
, . . . , x
t
n
t
} and random
noise empirical distribution ζ
n
t
over samples {z
t
1
, . . . , z
t
n
t
}, gradient
update rule on a batch with index set B
t
= {1, . . . , n
t
}:
ˆ
η,t
=
1
|B
t
|
X
iB
t
(
η
D(x
i
; η
t
)
η
D(g
θ
t
(z
i
); η
t
))
ˆ
θ,t
=
1
|B
t
|
X
iB
t
θ
D(g
θ
t
(z
i
); η
t
)
(10)
15
Wasserstein GAN (WGAN)
from Zero-Sum Game Point of View
I
Theoretical results
For a large class of zero-sum games, for examples, for bilinear
functions
min
x
max
y
x
T
Ay, (11)
and
min
x
max
y
(x
T
Ay + b
T
x + c
T
y + d), (12)
OMD actually converges to an equilibrium when some initializations
are satisfied.
I
Experiment findings
OMD with 1:1 generator-discriminator training ratio yields better KL
divergence than the alternative training scheme (1:5 ratio).
I
Remarks
GAN convergence points are local Nash equilibria which is possible
to be arbitrarily far from an actual Nash equilibrium. At these local
Nash equilibria, neither the discriminator nor the generator can
locally improve its objective.
16
Wasserstein GAN (WGAN)
from Zero-Sum Game Point of View
I
Remarks
Coulomb GAN uses a potential field created by point charges
analogously to the electric field in physics. It has only one Nash
equilibrium that is optimal, i.e., the model distribution matches the
target distribution. (arXiv: 1708.08819v3)
GAN converge to a local Nash equilibrium when it is trained by a
two time-scale update rule, i.e., when discriminator and generator
have separate learning rates. (arXiv: 1706.08500v6)
In the space of mixed strategies, GAN as finite game ensures that a
local Nash equilibrium is a global Nash equilibrium. (arXiv:
1806.07268v2)
I
Future research
Because the learned distribution is known partly by empirical
distribution, the algorithms of imperfect-information game can be
used to approximate Nash equilibrium of GAN as two-player
zero-sum game.
17