Generative Adversarial Nets

Problems and Advances

Jian-Zhou Zhang

College of Computer

Sichuan University

China

Generative Model Discussion Group

Semester Program on Computer Vision

ICERM

April 11, 2019

Outline

References

Generative Adversarial Nets (GAN)

– Model

– Theoretical Results

– Algorithm and Convergence

– Problems

Wasserstein GAN (WGAN) from An Optimal Transport Point of

View

– GAN Problem

– Wasserstein GAN (WGAN)

– Some results

– Computation

– Remarks

Wasserstein GAN from Zero-Sum Game Point of View

– Model

– Optimistic Mirror Decent (OMD)

Outline

Wasserstein GAN from Zero-Sum Game Point of View

– Model

– Optimistic Mirror Decent (OMD)

– Theoretical Results

– Experiment ﬁndings

– Remarks

– Future research

References

Ian J. Goodfellow et al. Generative adversarial nets. NIPS, 2014,

pp.2672-2680. (arXiv: 1406.2661)

He Huang et al. An introduction to image synthesis with generative

adversarial nets. 2018, arXiv: 1803.04469v2

Ying Nian Wu et al. A tale of three probabilistic families:

discriminative, descriptive, and generative models. Quarterly of

Applied Mathematics, 2019, 77(2): 423-465.

Martin Arjovsky et al. Wasserstein GAN. 2017, arXiv: 1701.07875

Aude Genevay et al. GAN and VAE from an optimal transport point

of view. 2017, arXiv: 1706.01807

Na Le et al. A geometric view of optimal transport and generative

model. Computer Aided Geometric Design, 2019, 68: 1-21.

Marco Cuturi, and Gabriel Peyre. Semidual regularized optimal

transport. SIAM Review, 2018, 60(4): 941-965.

References

Constantinos Daskalakis et al. Training GAN with optimism. 2018,

arXiv: 1711.00141

Thomas Unterthiner et al. Coulomb GANs: provably optimal Nash

equilibria via potential ﬁelds. 2018, arXiv: 1708.08819v3

Martin Heusel et al. GANs trained by a two time-scale update rule

converge to a local Nash equilibrium. 2018, arXiv: 1706.08500v6

Frans A. Oliehoek et al. Beyond local Nash equilibria for adversarial

networks. 2018, arXiv: 1806.07268v2

Lei Wang. Generative models for physicists.

http://wangleiphy.github.io/

Generative Adversarial Nets (GAN)

Model

– X denotes the data space in R

(for example: image space). P(X )

is a space of all probability distribution on X (object image set).

Assume the learned data distribution is ν ∈ P(X ). In fact ν is

replaced with its empirical distribution ν

over samples {x

, . . . , x

}

where ν

(x) =

i=1

δ(x − x

), and δ is a Dirac function or a

Gaussian Parzen window function.

– A Generator learns the data distribution ν. Its inputs take from

random noise variables z with distribution ζ, where z ∈ Z ⊆ R

and

ζ ∈ P(Z), and its outputs are g

(z) ∈ X . g

is realized by a deep

neural network with parameters θ. Because g

is a continuous map,

induces a pushforward operator g

θ#

so that g

θ#

ζ ∈ P(X ) (for

any set B ⊂ X , g

θ#

ζ(B) = ζ({z; g(z) ∈ B}) ).

– A Discriminator D(x; η) (x ∈ X ) outputs a single scalar that

represents the probability that x came from the data rather than g

D(x; η) is realized by a deep neural network with parameters η.

Generative Adversarial Nets (GAN)

Model

– GAN is the following two-player minimax game with value function

V (θ, η) = E

x∼ν

[log D(x; η)] + E

z∼ζ

[log(1 − D(g

(z); η))],

min

max

V (θ, η) (1)

Theoretical Results

– The global minimum of Eq.1 is achieved if and only if g

θ#

ζ = ν.

– At the point of global minimum of Eq.1, its value is − log 4.

Algorithm and Convergence

– Minibatch stochastic gradient descent training of generative

adversarial nets.

– If the generator and discriminator have enough capacity, and at each

step of algorithm, the discriminator is allowed to reach its optimum

given generator, and g

is updated so as to improve the criterion

V (θ, η) = E

x∼ν

[log D(x; η)] + E

z∼ζ

[log(1 − D(g

(z); η))], then

θ#

ζ converges to ν.

Generative Adversarial Nets (GAN)

Problems

– In order to avoid ”the Helvetica scenario” in which the generative

function collapses too many values of z to the same value of x to

have enough diversity to model g

θ#

ζ, the discriminator must be

trained more frequently than the generator.

– If the discriminator learns to accurately classify all samples, this leads

to vainishing gradients for the generator and inability to train in a

stable manner.

– The training process of GAN is not stable.

– There is no explicit representation of g

θ#

ζ.

Wasserstein GAN (WGAN)

from An Optimal Transp ort Point of View

GAN Problem

– If the discriminator learns to accurately classify all samples, this leads

to vainishing gradients for the generator and inability to train in a

stable manner.

Wasserstein GAN (WGAN)

– Idea: the discriminator rather than being treated as a classiﬁer is

instead trying to approximate a metric between the true distribution

and the generative distribution in the distribution space P(X ).

– Wasserstein distance:

(µ, ν) = min

γ∈P(X ×X )



X ×X

c(x, y)dγ(x, y)



(2)

where the cost function c(x, y): X × X → R, and for the joint

distribution γ(x, y) its ﬁrst and second marginal distributions are µ

and ν.

Wasserstein GAN (WGAN)

from An Optimal Transp ort Point of View

Wasserstein GAN (WGAN)

– The discriminator uses Wasserstein distance between distributions.

– The generator is Minimum Kantorovitch Estimator

min

E(θ) = W

θ#

ζ, ν) (3)

– The GAN model is interpreted by the optimal transport theory which

concerns the problem of connecting two probability distributions vis

transportation at a minimal cost.

– Because (2) is a linear program, (3) has a dual formulation known as

Kantorovich problem:

E(θ) = max



h(g

(z))dζ(z) +

h(y)dν(y)



(4)

where (h,

h) are continuous functions on X called Kantorovich

potentials satisfying h(x) +

h ≤ c(x, y).

Wasserstein GAN (WGAN)

from An Optimal Transp ort Point of View

Some results

– The cost of any pair (h,

h) can always be improved by replacing

h by

the c-transform h

of h deﬁned as

(y) = max

(c(x, y) − h(x)) (5)

Therefore (4) can parameterize as depending on only one potential

function.

– For L

transportation cost c(x, y) = |x − y| in R

, if h is 1-Lipschitz

(a deep neural network made of ReLU units whose Lipschitz constant

is upper-bounded by 1), then h

= −h.

– For L

transportation cost c(x, y) =

kx−yk

in R

, then

(y) =

k y k

− sup



< x, y > −



k x k

− h(x)



. (6)

Wasserstein GAN (WGAN)

from An Optimal Transp ort Point of View

Some results

– If For the transportation cost c(x, y) = h(x − y), where h is a strictly

convex function, then once the optimal discriminator is obtained, the

generator can be written down in an explicit formula.

Computation

– Approach 1: Stochastic gradient descent for large-scale optimal

transport. Since ν is discrete, the continuous potential

h can be

replaced by the discrete vector (

h(y

)), and h = (

. The

optimization over

h can then be achieved using Stochastic gradient

descent.(A.Genevay et al, arXiv:1605.08527v1)

– Approach 2: The dual potential h is restricted to have a parametric

form h

: X → R, where h

is represented by a discriminative deep

neural network. (4) is computed by

min

max



(z))dζ(z) +

j=1

)



(7)

(Martin Arjovsky et al, arXiv: 1701.07875)

Wasserstein GAN (WGAN)

from An Optimal Transp ort Point of View

Computation

– Approach 3: Semidual regularized optimal transport via entropic

regularization.(SIAM Review, 2018, 60(4): 941-965.)

– Approach 4: The optimal transport problem can be solved via

convex geometry.(Computer Aided Geometric Design, 2019, 68:

1-21.)

Remarks

– Remark 1: The goal of generative modeling is to capture the

probability distribution of high dimensional data and generate new

samples according to the learned distribution.

– Remark 2: For optimal transport problem, one ﬁxes both end

probability distributions, and aims at minimizing the transportation

cost.

Wasserstein GAN (WGAN)

from Zero-Sum Game Point of View

Model

– (previously) For L

transportation cost c(x, y) = |x − y| in R

, if h

is 1-Lipschitz (a deep neural network made of ReLU units whose

Lipschitz constant is upper-bounded by 1), then h

= −h.

– Then WGAN is the following two-player minimax game with value

function L(θ, η) = E

x∼ν

[D(x; η)] − E

z∼ζ

[D(g

(z); η)],

min

max

L(θ, η) (8)

– Some examples show that Gradient Descent(GD) dynamics always

lead to a limit cycle irrespective of the step size or other

modiﬁcations (for example Gradient Penalty, Momentum, Nesterov

Momentum) around the equilibrium.

Wasserstein GAN (WGAN)

from Zero-Sum Game Point of View

Optimistic Mirror Decent (OMD)(arXiv: 1711.00141)

– The update rule for OMD given (8):

t+1

= η

+ 2α∇

η,t

− α∇

η,t−1

t+1

= θ

− 2α∇

θ,t

+ α∇

θ,t−1

(9)

where α is learning rate or step size, and at t-th step, for the learned

empirical distribution ν

over samples {x

, . . . , x

} and random

noise empirical distribution ζ

over samples {z

, . . . , z

}, gradient

update rule on a batch with index set B

= {1, . . . , n

∇

η,t

i∈B

(∇

D(x

; η

) − ∇

D(g

); η

))

∇

θ,t

= −

i∈B

∇

D(g

); η

)

(10)

Wasserstein GAN (WGAN)

from Zero-Sum Game Point of View

Theoretical results

– For a large class of zero-sum games, for examples, for bilinear

functions

min

max

Ay, (11)

and

min

max

Ay + b

x + c

y + d), (12)

OMD actually converges to an equilibrium when some initializations

are satisﬁed.

Experiment ﬁndings

– OMD with 1:1 generator-discriminator training ratio yields better KL

divergence than the alternative training scheme (1:5 ratio).

Remarks

– GAN convergence points are local Nash equilibria which is possible

to be arbitrarily far from an actual Nash equilibrium. At these local

Nash equilibria, neither the discriminator nor the generator can

locally improve its objective.

Wasserstein GAN (WGAN)

from Zero-Sum Game Point of View

Remarks

– Coulomb GAN uses a potential ﬁeld created by point charges

analogously to the electric ﬁeld in physics. It has only one Nash

equilibrium that is optimal, i.e., the model distribution matches the

target distribution. (arXiv: 1708.08819v3)

– GAN converge to a local Nash equilibrium when it is trained by a

two time-scale update rule, i.e., when discriminator and generator

have separate learning rates. (arXiv: 1706.08500v6)

– In the space of mixed strategies, GAN as ﬁnite game ensures that a

local Nash equilibrium is a global Nash equilibrium. (arXiv:

1806.07268v2)

Future research

– Because the learned distribution is known partly by empirical

distribution, the algorithms of imperfect-information game can be

used to approximate Nash equilibrium of GAN as two-player

zero-sum game.