Bayesian Inference Recap

We discussed Bayesian Inference in the last two lectures. Being Bayesian offers certain advantages. First, we get to use the domain knowledge. Second, under certain assumptions, Bayesian inference is optimal. However, it is possible that those assumptions may not hold for particular usecases. Third, once you have specified the model there is, in principle, a single direct way to answer most questions—we poise most of our queries in the form of posterior expectations.

However, as with most things, Bayesian inference also has some disadvantages. First, is the same as one of its advantages—you have to use domain knowledge; coming up with accurate domain information can be very difficult.

Second, you need to do computationally difficult inference. Often, being Bayesian will require more resources than its counterparts. Consider being Bayesian over all the weights in a neural network. Nowadays, state-of-the-art neural networks can have more than a billion weights. Working at such a scale still remains a problem if you want to be Bayesian and is an open research area.

Third, Bayesian optimality requires access to the correct model. This is a very strong assumption to meet in real life.

Bayesian Inference needs strong assumptions and hard inference, but gets strong results.

Variational Inference (VI)

Variational Inference is an alternative to MCMC for inference. Unlike MCMC, variational inference is an approximate inference methodology in the sense that it will not converge to the true posterior even if you run it forever (remember, MCMC chains converge to the stationary distribution (posterior) eventually.)

Let's consider a few examples before we start discussing variational inference.

High Tree Width Markov Random Fields(MRFs)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/47c4cd86-6aba-4ac6-886d-2b182f249965/PCM_lect_22_VI_grid.svg

Consider the MRF above. We can represent it as

$$ p(z) = \frac{1}{Z} \bar{p}(z), $$

where $\bar{p}(z)$ is the unnormalized probability and $Z$ is the normalization constant. Often, we will know $\bar p (z)$ but will not know $Z$.

Bayesian Model

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/bccf1e9b-0a2a-4967-b292-adb2465ba934/PGM_lect_22_1.svg

Consider the model above. We can represent the above model using this very generic notation $p(z, x )$, where $z$ are the latent variables and $x$ is the data. In such models, we are interested in posterior