In this lecture, we will continue our discussion on exponential families. We will discuss the equivalence of MRFs and exponential families. Then, we will extend that result to CRFs. Thereafter, we will look at the maximum likelihood solutions for MRFs (and CRFs) through the lens of exponential families and draw some very generic counting results.

Exponential Family Recap

Let's start by remembering what we saw last time. First, an exponential family is given by

$$ \begin{aligned} p_{\theta}(x) = h(x) \exp(\theta^{\top}T(x) - A(\theta)), \end{aligned} $$

where $\theta$ are "natural parameters", $T(x)$ is the "sufficient statistics", $A(\theta)$ is the "log partition function", and $h(x)$ is the scaling constant (almost always equals to $1$ in this course).

Next, we saw some interesting properties of exponential family distributions.

Property #1

For any $\theta$, the gradient of the log-partition function is the expected value of the sufficient statistics, i.e.

$$ \begin{aligned} \frac{\partial A(\theta) }{\partial\theta} = \mathbb{E}{p\theta(\mathsf{x})} [T(\mathsf{x})] \end{aligned}. $$

It may not be immediately obvious why this equation is important. Think of it like this: If you wanted to maximize $p_\theta(x)$ (or $\log p_\theta (x)$), you would need this derivative. Having it in closed form is thus immensely useful in helping us understand what happens when you maximize the likelihood with exponential families.

Mean log-likelihood

Next, we presented an elegant-looking equation for the mean log-likelihood over a dataset distributed as an exponential family. It can be written as

$$ \begin{aligned} L(\theta)= \theta^{\top} \overline{T} - A(\theta)+C \end{aligned},\\

$$

where, $\overline{T} = \frac{1}{N} \sum_{n=1}^{N} T(x^{(n)})$ is the mean value of the sufficient statistics in the data, and
$C= \frac{1}{N} \sum_{n=1}^{N} \log h(x^{(n)})$ is independent of $\theta$.

This equation has a nice interpretation. To maximize the likelihood, we generally want $\theta$ to be well-aligned with $\overline T$ while keeping the $A(\theta)$ term from becoming too large. This intuition is nicely captured in the next property of exponential families.

Maximum Likehood Esitmate (MLE)

At the maximum likelihood parameters $\theta$,

$$ \begin{aligned} \mathbb{E}{p\theta(\mathsf{x})} [T(\mathsf{x})] = \overline{T} = \hat{\mathbb{E}}\space [T(\mathsf{x})]\end{aligned}. $$

This equation says that if you found the maximum likelihood parameters, then the parameters will make the model expectation of $T$ equal to the data expectation. This is sometimes called "moment matching."