The exponential family is a powerful tool that will help us to think about a lot of things probabilistically in a more general way. In particular, undirected models are exponential families, which is why we are interested in learning about exponential families. In this lecture, we will introduce the definition of the exponential family, examples of exponential families, critical properties of the exponential families and the maximum likelihood learning in exponential families.
An exponential family is a set of distributions that can be written in the following form:
$$ \begin{aligned} p_{\theta}(x) = h(x) \exp(\theta^\top T(x) - A(\theta)) \end{aligned}. $$
The components are defined as follows:
For continuous distributions, the log partition function $A(\theta)$ is defined as
$$ \begin{aligned} A(\theta) = \log\int_x h(x) \exp(\theta^\top T(x))dx. \end{aligned} $$
For discrete distributions, the log partition function $A(\theta)$ is defined as
$$ \begin{aligned} A(\theta) = \log\sum_x h(x) \exp(\theta^\top T(x)). \end{aligned} $$
$T(x)$ is called the sufficient statistics because the average value of a dataset evaluated with the function $T(x)$ is "sufficient" for maximum likelihood learning. We will see this in detail in the maximum likelihood learning section.
$h(x)$ is a function of the variable $\mathsf{x}$, but is called the scaling constant because it does not depend on the parameter $\theta$. In this course it will almost always be $1$, but we will leave it in derivations to be consistent with the standard presentations of the exponential family.
$A(\theta)$ is called the log partition function and it is the normalizer of the distribution. It is easy to see once we re-write it as (the definition of $A(\theta)$ in continuous distributions is used here)
$$ \begin{aligned}
p_{\theta}(x) &= h(x) \exp(\theta^\top T(x) - A(\theta))\\ &=\frac{h(x) \exp(\theta^\top T(x))}{\exp(A(\theta))}\\ &= \frac{h(x) \exp(\theta^\top T(x))}{\int_x h(x) \exp(\theta^\top T(x))dx}.
\end{aligned} $$
To understand the exponential family intuitively, suppose we have some "features" $T(x)$ and we'd like to use a set of linear "scores" $\theta^\top T(x)$ to define a distribution, the simplest way to go about this is to use exponential function ($\exp()$) to guarantee the "scores" are positive and then divide the "scores" with a normalizer ($A(\theta)$) to keep the summation of all probabilities of all possible values equal $1$. From this perspective, the exponential family is the simplest way to turn a set of linear scores into a probability distribution.