1. Recap on Exponential Family and Conditional Exponential Family

Exponential family table

$$ \tiny \begin{array}{c c } & \mathrm{Exponential\ Family} & \mathrm{Conditional \ Exponential\ Family} \\ \hline \rule{0pt}{5ex} \mathrm{Definition} & p_\theta(x) = h(x) \exp (\theta^\top T(x) − A(\theta)) & p_\theta(y|x) = h(x, y) \exp (\theta^\top T(x, y) − A(x, \theta))

\\ \rule{0pt}{8ex} \begin{array}{c} \mathrm{Log-partition}\\ \mathrm{function} \end{array} & A(\theta)=\log\sum_x h(x)\exp\theta^\top T(x) & A(x, \theta)=\log\sum_y h(x,y)\exp\theta^\top T(x,y)\\

\rule{0pt}{8ex} \begin{array}{c} \mathrm{Gradient\ of}\\ \mathrm{log-partition}\\ \mathrm{function} \end{array} & \begin{aligned} \frac{\partial A(\theta)}{\partial \theta} = \mathbb{E}{p\theta(\mathsf{x})} [T(\mathsf{x})] \end{aligned} & \begin{aligned} \frac{\partial A(x; \theta)}{\partial \theta} = \mathbb{E}{p\theta(\mathsf{y}|x)} [T(x, \mathsf{y})] \end{aligned} \\

\rule{0pt}{8ex} \begin{array}{c} \mathrm{Objective\ for}\\ \mathrm{a\ single\ datum} \end{array}
& \log p_\theta(x) = \log h(x) + \theta^\top T(x) - A(\theta) & \log p_\theta(y \vert x) = \log h(x, y) + \theta^\top T(x, y) - A(x, \theta)\\

\rule{0pt}{8ex}
\begin{array}{c} \mathrm{Form\ of\ learning}\\ \mathrm{objective} \end{array} & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \log p(x^{(i)} | \theta) & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \log p_\theta(y^{(i)} | x^{(i)})\\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Alternative\ form} \end{array} & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \theta^\top T(x^{(i)}) - A(\theta)) & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h(x^{(i)}, y^{(i)}) + \theta^\top T(x^{(i)}, y^{(i)}) - A(x^{(i)}, \theta))\\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Alternative\ form} \end{array} & L(\theta) = {\hat{\mathbb{E}}}\mathsf{x} [\theta^\top T(\mathsf{x})] - A(\theta) & L(\theta) = \hat{\mathbb{E}}{\mathsf{x}, \mathsf{y}} [\log h(\mathsf{x}, \mathsf{y}) + \theta^\top T(\mathsf{x}, \mathsf{y})] - \hat{\mathbb{E}}_\mathsf{x} [A(\mathsf{x}, \theta)] \\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Condition\ at}\\ \mathrm{optimum} \end{array} & { \hat{\mathbb{E}}}\mathsf{x} [T(\mathsf{x})] = \mathbb{E}{p_\theta (\mathsf{x})} [T(\mathsf{x})] & \hat{{\mathbb{E}}}{\mathsf{x},\mathsf{y}}[T(\mathsf{x}, \mathsf{y})] = \hat{\mathbb{E}}{\mathsf{x}} \mathbb{E}{p\theta (\mathsf{y \vert x})}[T(\mathsf{x}, \mathsf{y})] \\ \end{array} $$

In the previous lecture, we learned about the exponential family and conditional exponential family. Similar to how you can create a conditional random field (CRF) by taking a Markov random field (MRF) and conditioning it, you can create a conditional exponential family by taking an exponential family.

Everything we have learned so far is summarized in the table above. The important things to remember about the exponential family are:

The log-partition function $A(\theta)$ is the normalizer used to normalize the exponential family to $1$.
The gradient of the log-partition function is the expected value of the sufficient statistics, $\nabla A(\theta) = \mathbb{E}{p\theta(\mathsf{x})} T(\mathsf{x})$.
At the optimum of maximum likelihood learning, the empirical expectation $\hat{\mathbb{E}} T(\mathsf{x})$ must be equal to the model expectation of $\mathbb{E}{p\theta(\mathsf{x})} T(\mathsf{x})$. This is called "moment-matching".

The important things to note about the conditional exponential family are:

The variables are split into $x$ and $y$, so the sufficient statistics are written as $T(x,y)$. And in the log partition function, the $x$ is clamped, so $A(x;\theta)$ only sums over the values of $y.$
The gradient of the log-partition function is the expected value of the sufficient statistics under the conditional probabilities $p_\theta(\mathsf{y}|x)$, i.e. $\nabla_\theta A(x;\theta) = \mathbb{E}{p\theta(\mathsf{y}\vert x)} T(x,\mathsf{y})$. Note here that in the expectation, $\mathsf{y}$ is random variable (which is why we don't write it as $y$), while $x$ is just some arbitrary fixed value.
At the optimum of the conditional likelihood, the empirical expectation of $T$ over both $x$ and $y$ must equal the empirical expectation $T$ over $x$ only, along with the model expectation over $y$, which we write as

$$ \hat{\mathbb{E}}{\mathsf{x},\mathsf{y}}[T(\mathsf{x},\mathsf{y})] = \hat{\mathbb{E}}{\mathsf{x}} [\mathbb{E}{p\theta(\mathsf{y}|\mathsf{x})}T(\mathsf{x},\mathsf{y})]. $$

A good way to remember this condition is to note that with a conditional exponential family we don't have a model distribution over $x$, so it doesn't make sense to write $p_\theta(x)$. Instead, we replace that with the data we have for $x$. It results in the hybrid of empirical expectation and the model expectation on the right hand side of the condition. In some sense, this equation "couldn't be anything else".

Question: Is maximum likelihood learning in exponential families always concave?

Someone asked this last time. The answer is Yes.

First, note that the likelihood is a sum of two terms.

$$ L(\theta)=\frac{1}{n}\sum_{i=1}^n \theta^\top T(x^{(i)})-A(\theta). $$

To prove the log likelihood function is concave is equivalent to proving the log partition function is convex since the other part in the log likelihood function is a linear function of $\theta$.