Tutorial on Probabilistic Programming
Contents
Summary
This week, Daniel gave a tutorial on probabilistic programming and its use in generative modeling.
What is a PPL?
Probabilistic programming languages (PPLs) leverage powerful programming concepts such as recursion, abstraction and modularity to define and sample from user-specified distributions and perform inference on statistical models.
For example, here is a program written in WebPPL:
|
|
In this program, geometric()
computes a geometric distribution from samples of a Bernoulli distribution, conditionedGeometric()
constrains these samples with a condition, and the enumerate()
method computes the conditional distribution of geometric samples conditioned on the constraint (x > 2)
.
Types of PPLs
Probabilistic programming languages fall into two categories – universal and non-universal. Universal PPLs such as Church were developed to model complex hierarchical Bayesian models, such as those used in cognitive science.
Universal PPLs can sample from any computable probability distribution.
A computable probability distribution is a distribution that a probabilistic Turing machine can sample correctly to an arbitrary precision. As models of computational complexity, probabilistic Turing machines are similar to nondeterministic Turing machines in that they allow for multiple valid paths of execution. However, rather than computing all paths simultaneously, probabilistic Turing machines branch according to a random decision probability. The different paths may lead to disparate outcomes (some runs accepting, others rejecting identical input tokens) so rather than definitively deciding whether an input is in the language, a probabilistic Turing Machine decides with some pre-defined error.
Not all probability distributions are computable even under these assumptions, and even if the joint distribution of two random variables is computable, the conditional distribution of these variables might not be [1].
How are PPLs used?
The design philosophy of PPLs is to separate the model construction from the inference engine, so that domain experts such as cognitive scientists can define complex models without having to compute the conditional distributions required to perform inference on their model.
|
|
This Bayes net computes the probability of an earthquake
conditioned on a call from John
. Here are some more examples of different statistical models expressed in PPLs:
- Hidden Markov Model
- Linear Regression
- Grammars
- Latent Dirichlet Allocation for topic modeling in documents
- Vision as Inverse Graphics
How does PPL Inference Work
How the posterior calculation, or
infer
, step is computed depends on the design of the PPL, the performance requirements, and the type of events being sampled.
Exact Enumeration
Infer
can be computed through exact enumeration when the event space is discrete. Exact enumeration is only possible in languages such as WebPPL that have constructs that allow for pausing/resuming computation such as threads, coroutines or continuations. For the following program:
|
|
The probability of generating $z=5$ is computed as $$P(x=4, y=1) + P(x=2, y=3) + … P(x=1, y=4)$$
Sequential Monte Carlo (SMC)
In the continuous case, inference via exact enumeration is impossible: since each branch is a sample from a continuous distribution, there are infinite possible paths. Sequential Monte Carlo is the approximate generalization of enumeration – the $n$ highest probability execution paths are executed. Here is a particle filter example of SMC.
Markov Chain Monte Carlo (MCMC)
Metropolis Hastings Algorithm
Given: A function $f(x)$, and a proposal distribution $Q(x’|x)$,
- $x$ is a deterministic trace through a probabilistic program – resulting from filling in all the random choices
- $f(x)$ is the density of trace $x$
- Transition function $Q(x’|x)$ is changing one of the random choices in the program trace
Sample a new state $x’\sim Q(\cdot|x)$
- Accept with probability $A(x’|x) = \min(1, \frac{f(x’)Q(x|x’)}{f(x)Q(x’|x)})$
- Repeat $N$ times
- The probability of visiting state $x$ is proportional to $f(x)$
Evaluating the transition model $Q(x’|x)$ is nontrivial – random choices upstream impact program execution downstream, including the selection of further random variables. This can be avoided by choosing local proposals [2].
Variational Inference
Approximating the infer
function via variational inference utilizes the optimization of an easy-to-sample guide distribution to approximate the hard-to-sample posterior. Given a generative model $p(x,y)$, in our case a program written in a PPL, the goal of infer
is to compute the posterior $p(x|y)$. This conditional distribution is often intractable.
However for any given $y$ there is a tractable $q_0(x)$ that can be used to approximate the posterior. The optimal $q_0$ can be found by minimizing the KL divergence between $q_0(x)$ and $p(x|y)$.
References
[1] Ackerman, Nathanael L., Cameron E. Freer, and Daniel M. Roy. 2010. “On the Computability of Conditional Probability.” arXiv [math.LO]. arXiv. link
[2] Wingate, David, Andreas Stuhlmueller, and Noah Goodman. 2011. “Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation.” In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, edited by Geoffrey Gordon, David Dunson, and Miroslav Dudík, 15:770–78. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR. link
Author Theresa Barton
LastMod 2019-02-28
Comments powered by Talkyard.