Probabilistic Programming
Daniel Ritchie
Computer Science
What Is a Probabilistic Program?
“Show, don’t tell”: WebPPL demo
WebPPL is a universal probabilistic language
It can express any computable probability distribution(!)
Non-Universal Languages
Universal Languages
What Probabilistic Programming Languages (PPLs)
Are Out There Today?
Church
Venture
Why Were Universal PPLs Developed?
Church: developed for
computational cognitive science
Complex hierarchical Bayesian
models
Recursive inference: “I’m
thinking that you’re thinking
that…”
var f = function() {
Infer(g)
}
var g = function() {
Infer(f)
}
Why Were Universal PPLs Developed?
Church: developed for
computational cognitive science
Complex hierarchical Bayesian
models
Recursive inference: “I’m
thinking that you’re thinking
that…”
Pragmatic language
https://cocolab.stanford.edu/
Why Were Universal PPLs Developed?
Church: developed for
computational cognitive science
Complex hierarchical Bayesian
models
Recursive inference: “I’m
thinking that you’re thinking
that…”
Pragmatic language
Multi-agent games
Wikipedia
Why Have They Become “Popular?
Inference Engine
Written by the language designer
(i.e. an inference expert)
The Model (i.e. the program)
Written by the user (i.e. a domain
expert)
Abstraction Barrier
PPL design philosophy*: separate modeling from inference so domain experts can
focus on the former and inference experts can focus on the latter
* See e.g. programmable inference in Venture
WHAT ELSE CAN YOU USE PPLS FOR?
Remember: Any computable probability distribution
Bayesian Networks
http://forestdb.org/models/burglary.html
Linear Regression
https://github.com/probmods/webppl/blob/dev/examples/linea
rRegression.wppl
Hidden Markov Model
https://github.com/probmods/webppl/blob/dev/examples/hmm
.wppl
Latent Dirichlet Allocation (LDA)
https://github.com/probmods/webppl/blob/dev/examples/lda.w
ppl
Grammars (PCFG)
https://github.com/probmods/webppl/blob/dev/examples/pcfg.
wppl
Vision as Inverse Graphics
Kulkarni et al. 2015. Picture: A Probabilistic Programming Language for Scene Perception.
Procedural Graphics
HOW DOES PPL INFERENCE WORK?
(i.e. how does one implement Infer?)
Exact Inference: Enumeration
var model = function() {
var x = uniformDraw([1, 2, 3, 4])
var y = uniformDraw([1, 2, 3, 4])
var z = x + y
condition(z > 3)
return z
}
Enumerate all possible ways of returning
e.g. 5, sum their probabilities:
(1, 4): 0.769…
(2, 3): 0.769…
(3, 2): 0.769…
(4, 1): 0.769…
4 x 0.769… = 0.307…
Exact Inference: Enumeration
Can be implemented with any language construct that allows
pausing and resuming a computation.
Continuations, coroutines, threads, …
WebPPL: Creates a continuation at each random choice
repeatedly continue computation from that point w/ different
choice value until all possible values exhausted
Approximate Inference: SMC
SMC = Sequential Monte Carlo
The continuous generalization of Enumeration
Continuous random choices have an uncountably infinite set
of possible values
Cannot explore all paths. Instead, explore N highest-
probability paths.
Each time a factor is encountered, reweight and resample the
current set of paths (i.e. discard low-probability partial paths
in favor of high-probability ones)
Approximate Inference: SMC
Example: SMC for a semi-Markov random walk:
http://dippl.org/chapters/05-particlefilter.html
Approximate Inference: MCMC
Given a function
And a proposal distribution
Sample new state
Accept a new state with probability
i.e. if accept, then
Repeat N times (for large N)
Can be shown that probability of visiting state is proportional to
Review: Metropolis Hastings (MH) algorithm
Approximate Inference: MCMC
MH for programs
is a trace through the program (i.e. an assignment of values
to random variables)
is the un-normalized density of this trace
Typical implementation of : randomly change the
value of one random choice
May lead to the creation/deletion of other random choices…
Approximate Inference: MCMC
Structural Naming for maximum trace re-use (i.e. proposals that
are “as local as possible”
Wingate et al. 2011. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation.
Approximate Inference: MCMC
Simple vision-as-inverse-graphics demo:
http://dippl.org/examples/vision.html
Approximate Inference: Variational
Review: Variational inference
Suppose we have a generative model 
We observe the values of y and would like to compute the
posterior usually complex, intractable
For a given , can approximate this posterior with a tractable
just need to find the right value of theta.
Optimization objective: 


Approximate Inference: Variational
Simple strategy: Mean field variational inference
Distribution defined by a probabilistic program factorizes as


.
Mean-field approximation:
i.e. approximate the posterior as a product of independent marginals.
Probabilistic programming language can automatically derive the
mean field program of any input program.
Use automatic differentiation (AD) to compute gradients through
the
sub-programs.
Approximate Inference: Variational
Example automatic mean-field program transformation
Wingate & Weber 2013. Automated Variational Inference in Probabilistic Programming.
Amortized” Inference
Mean field VI optimizes a posterior approximation for one
particular observation .
But, we typically want to use the same model to perform many
inference tasks with many different s.
Can we learn from past inferences so that future inferences (on
new, unseen s) are more efficient?
Amortized Variational Inference
Mean-field variational approximation:
Amortized variational approximation:


where the
are neural networks
Aside: Amortized Variational Learning
It is possible to jointly optimize the parameters
of the
generative model (i.e. learning) while simultaneously optimizing
the parameters of the approximate posterior
 .
This is how Variational Autoencoders (VAEs) are trained, which we
will see next time…
Amortized Variational Inference
Amortized variational approximation:


where the
are neural networks
What form should these
take?
Grand vision: automatically derive this for any program
Using e.g. a recurrent network backbone
In practice: the program author typically writes a domain-specific “guide
program” that specifies the form of

E.g. use CNNs for image-valued
Fun Amortized Inference Example