Probabilistic Programming

Daniel Ritchie

Computer Science

What Is a Probabilistic Program?

“Show, don’t tell”: WebPPL demo

WebPPL is a universal probabilistic language

It can express any computable probability distribution(!)

Non-Universal Languages

Universal Languages

What Probabilistic Programming Languages (PPLs)

Are Out There Today?

Church

Venture

Why Were Universal PPLs Developed?

Church: developed for

computational cognitive science

• Complex hierarchical Bayesian

models

• Recursive inference: “I’m

thinking that you’re thinking

that…”

var f = function() {

…

Infer(g)

…

}

var g = function() {

…

Infer(f)

…

}

Why Were Universal PPLs Developed?

Church: developed for

computational cognitive science

• Complex hierarchical Bayesian

models

• Recursive inference: “I’m

thinking that you’re thinking

that…”

• Pragmatic language

https://cocolab.stanford.edu/

Why Were Universal PPLs Developed?

Church: developed for

computational cognitive science

• Complex hierarchical Bayesian

models

• Recursive inference: “I’m

thinking that you’re thinking

that…”

• Pragmatic language

• Multi-agent games

Wikipedia

Why Have They Become “Popular?”

Inference Engine

Written by the language designer

(i.e. an inference expert)

The Model (i.e. the program)

Written by the user (i.e. a domain

expert)

Abstraction Barrier

PPL design philosophy*: separate modeling from inference so domain experts can

focus on the former and inference experts can focus on the latter

* See e.g. programmable inference in Venture

WHAT ELSE CAN YOU USE PPLS FOR?

Remember: Any computable probability distribution

Bayesian Networks

http://forestdb.org/models/burglary.html

Linear Regression

https://github.com/probmods/webppl/blob/dev/examples/linea

rRegression.wppl

Hidden Markov Model

https://github.com/probmods/webppl/blob/dev/examples/hmm

.wppl

Latent Dirichlet Allocation (LDA)

https://github.com/probmods/webppl/blob/dev/examples/lda.w

ppl

Grammars (PCFG)

https://github.com/probmods/webppl/blob/dev/examples/pcfg.

wppl

Vision as Inverse Graphics

Kulkarni et al. 2015. Picture: A Probabilistic Programming Language for Scene Perception.

Procedural Graphics

HOW DOES PPL INFERENCE WORK?

(i.e. how does one implement Infer?)

Exact Inference: Enumeration

var model = function() {

var x = uniformDraw([1, 2, 3, 4])

var y = uniformDraw([1, 2, 3, 4])

var z = x + y

condition(z > 3)

return z

}

Enumerate all possible ways of returning

e.g. 5, sum their probabilities:

• (1, 4): 0.769…

• (2, 3): 0.769…

• (3, 2): 0.769…

• (4, 1): 0.769…

4 x 0.769… = 0.307…

Exact Inference: Enumeration

Can be implemented with any language construct that allows

pausing and resuming a computation.

• Continuations, coroutines, threads, …

WebPPL: Creates a continuation at each random choice →

repeatedly continue computation from that point w/ different

choice value until all possible values exhausted

Approximate Inference: SMC

SMC = Sequential Monte Carlo

• The continuous generalization of Enumeration

• Continuous random choices have an uncountably infinite set

of possible values

• Cannot explore all paths. Instead, explore N highest-

probability paths.

• Each time a factor is encountered, reweight and resample the

current set of paths (i.e. discard low-probability partial paths

in favor of high-probability ones)

Approximate Inference: SMC

Example: SMC for a semi-Markov random walk:

http://dippl.org/chapters/05-particlefilter.html

Approximate Inference: MCMC

Given a function

And a proposal distribution

Sample new state

Accept a new state with probability

i.e. if accept, then

Repeat N times (for large N)

Can be shown that probability of visiting state is proportional to

Review: Metropolis Hastings (MH) algorithm

Approximate Inference: MCMC

MH for programs

• is a trace through the program (i.e. an assignment of values

to random variables)

• is the un-normalized density of this trace

• Typical implementation of : randomly change the

value of one random choice

• May lead to the creation/deletion of other random choices…

Approximate Inference: MCMC

Structural Naming for maximum trace re-use (i.e. proposals that

are “as local as possible”

Wingate et al. 2011. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation.

Approximate Inference: MCMC

Simple vision-as-inverse-graphics demo:

http://dippl.org/examples/vision.html

Approximate Inference: Variational

Review: Variational inference

• Suppose we have a generative model 

• We observe the values of y and would like to compute the

posterior   – usually complex, intractable

• For a given , can approximate this posterior with a tractable





– just need to find the right value of theta.

• Optimization objective: 







   

Approximate Inference: Variational

Simple strategy: Mean field variational inference

• Distribution defined by a probabilistic program factorizes as

  





















.

• Mean-field approximation: 



 



















• i.e. approximate the posterior as a product of independent marginals.

• Probabilistic programming language can automatically derive the

mean field program of any input program.

• Use automatic differentiation (AD) to compute gradients through

the 



sub-programs.

Approximate Inference: Variational

Example automatic mean-field program transformation

Wingate & Weber 2013. Automated Variational Inference in Probabilistic Programming.

“Amortized” Inference

Mean field VI optimizes a posterior approximation for one

particular observation .

But, we typically want to use the same model to perform many

inference tasks with many different ’s.

Can we learn from past inferences so that future inferences (on

new, unseen ’s) are more efficient?

Amortized Variational Inference

Mean-field variational approximation:





  

















Amortized variational approximation:





  

















where the 



are neural networks

Aside: Amortized Variational Learning

It is possible to jointly optimize the parameters 



of the

generative model (i.e. learning) while simultaneously optimizing

the parameters of the approximate posterior 



 .

• This is how Variational Autoencoders (VAEs) are trained, which we

will see next time…

Amortized Variational Inference

Amortized variational approximation:





  

















where the 



are neural networks

What form should these 



take?

• Grand vision: automatically derive this for any program

• Using e.g. a recurrent network backbone

• In practice: the program author typically writes a domain-specific “guide

program” that specifies the form of 





• E.g. use CNNs for image-valued 

Fun Amortized Inference Example