1. Recap on Exponential Family and Conditional Exponential Family

Exponential family table

$$ \tiny \begin{array}{c c } & \mathrm{Exponential\ Family} & \mathrm{Conditional \ Exponential\ Family} \\ \hline \rule{0pt}{5ex} \mathrm{Definition} & p_\theta(x) = h(x) \exp (\theta^\top T(x) − A(\theta)) & p_\theta(y|x) = h(x, y) \exp (\theta^\top T(x, y) − A(x, \theta))

\\ \rule{0pt}{8ex} \begin{array}{c} \mathrm{Log-partition}\\ \mathrm{function} \end{array} & A(\theta)=\log\sum_x h(x)\exp\theta^\top T(x) & A(x, \theta)=\log\sum_y h(x,y)\exp\theta^\top T(x,y)\\

\rule{0pt}{8ex} \begin{array}{c} \mathrm{Gradient\ of}\\ \mathrm{log-partition}\\ \mathrm{function} \end{array} & \begin{aligned} \frac{\partial A(\theta)}{\partial \theta} = \mathbb{E}{p\theta(\mathsf{x})} [T(\mathsf{x})] \end{aligned} & \begin{aligned} \frac{\partial A(x; \theta)}{\partial \theta} = \mathbb{E}{p\theta(\mathsf{y}|x)} [T(x, \mathsf{y})] \end{aligned} \\

\rule{0pt}{8ex} \begin{array}{c} \mathrm{Objective\ for}\\ \mathrm{a\ single\ datum} \end{array}
& \log p_\theta(x) = \log h(x) + \theta^\top T(x) - A(\theta) & \log p_\theta(y \vert x) = \log h(x, y) + \theta^\top T(x, y) - A(x, \theta)\\

\rule{0pt}{8ex}
\begin{array}{c} \mathrm{Form\ of\ learning}\\ \mathrm{objective} \end{array} & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \log p(x^{(i)} | \theta) & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \log p_\theta(y^{(i)} | x^{(i)})\\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Alternative\ form} \end{array} & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \theta^\top T(x^{(i)}) - A(\theta)) & L(\theta) = \frac{1}{n} \sum_{i=1}^{n} (h(x^{(i)}, y^{(i)}) + \theta^\top T(x^{(i)}, y^{(i)}) - A(x^{(i)}, \theta))\\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Alternative\ form} \end{array} & L(\theta) = {\hat{\mathbb{E}}}\mathsf{x} [\theta^\top T(\mathsf{x})] - A(\theta) & L(\theta) = \hat{\mathbb{E}}{\mathsf{x}, \mathsf{y}} [\log h(\mathsf{x}, \mathsf{y}) + \theta^\top T(\mathsf{x}, \mathsf{y})] - \hat{\mathbb{E}}_\mathsf{x} [A(\mathsf{x}, \theta)] \\

\rule{0pt}{5ex}
\begin{array}{c} \mathrm{Condition\ at}\\ \mathrm{optimum} \end{array} & { \hat{\mathbb{E}}}\mathsf{x} [T(\mathsf{x})] = \mathbb{E}{p_\theta (\mathsf{x})} [T(\mathsf{x})] & \hat{{\mathbb{E}}}{\mathsf{x},\mathsf{y}}[T(\mathsf{x}, \mathsf{y})] = \hat{\mathbb{E}}{\mathsf{x}} \mathbb{E}{p\theta (\mathsf{y \vert x})}[T(\mathsf{x}, \mathsf{y})] \\ \end{array} $$

In the previous lecture, we learned about the exponential family and conditional exponential family. Similar to how you can create a conditional random field (CRF) by taking a Markov random field (MRF) and conditioning it, you can create a conditional exponential family by taking an exponential family.

Everything we have learned so far is summarized in the table above. The important things to remember about the exponential family are:

The important things to note about the conditional exponential family are:

$$ \hat{\mathbb{E}}{\mathsf{x},\mathsf{y}}[T(\mathsf{x},\mathsf{y})] = \hat{\mathbb{E}}{\mathsf{x}} [\mathbb{E}{p\theta(\mathsf{y}|\mathsf{x})}T(\mathsf{x},\mathsf{y})]. $$

Question: Is maximum likelihood learning in exponential families always concave?

Someone asked this last time. The answer is Yes.

First, note that the likelihood is a sum of two terms.

$$ L(\theta)=\frac{1}{n}\sum_{i=1}^n \theta^\top T(x^{(i)})-A(\theta). $$

To prove the log likelihood function is concave is equivalent to proving the log partition function is convex since the other part in the log likelihood function is a linear function of $\theta$.