In the previous lectures on maximum likelihood learning, we discussed how it is equivalent to minimizing the KL-divergence between the true distribution and our model. Today, we will have a more nuanced discussion on the maximum likelihood estimator. In particular, we will discuss its efficiency. Here by efficiency we mean how much error does the estimator tend to have with a given amount of data. It turns out that maximum likelihood is pretty good from this perspective; in a certain sense, there can not be any estimator that's much better than it.

Recap of the last the lecture

In the last lecture, we considered the important question of "to condition or not to condition?" We used two different lenses to look at this question. First, we looked at the statistical implications. Second, we discussed the computational arguments.

Statistically, if we have a perfect model, then the joint maximum-likelihood has lower variance than of conditional alternative. However, if we do not have a perfect model but have access to infinite data, the conditional maximum likelihood has a better (or equal) solution.

Computationally, in each iteration, joint maximum likelihood solves a harder inference problem (as it spans over both $\mathsf{x}$ & $\mathsf{y}$.) Meanwhile, conditional maximum likelihood solves a less hard inference problem (over $\mathsf{y}$ only) more number of items; in particular, it solves a less hard problem $N$ times each iteration.

In a utopian world, we would hope to have a much more concrete answer to our initial question. However, in practice, it is often the case that you have no easy way to decide whether to condition or not. One needs to analyze the problem at hand and make the decision based on the problem.

We will now shift the focus to today's lecture.

How good is maximum likelihood?

Let's suppose that the true distribution is $p_{\theta^}(x)$. This might look innocent but it's a strong assumption — we are assuming there is no model error so we can represent the true distribution. Now suppose you gather data, $x^{(1)},\dots,x^{(n)}$ generated by the true distribution $p_{\theta^}(x)$. (Of course, you don't know $\theta^*$.) Given this data, you maximize the likelihood and get an estimate $\hat{\theta}_n$.

It is important that we think of this $\hat{\theta}_n$ as a random variable; We think of the data as being randomly generated and since our estimator depends on the data it is also random. We get different estimates for different datasets. Our goal is to understand if doing maximum likelihood learning tends to find an estimate $\hat \theta_n$ that is close to $\theta^*$.

Now $\hat{\theta}_n$ will almost never be exactly equal to $\theta^$ (we'd have to be really lucky.) We want to quantify is how different is $\hat \theta_n$ from $\theta^$ usually? We will look at the average performance, averaged over all possible datasets. If it tends to be close, then we'd think that maximum likelihood is a good estimator.

In general, this is hard to say, but we can make excellent approximations when $n$ is large (in the asymptotic sense.) Before introducing the main result, we will go over some ingredients we'll need.

Ingredient #1: Central Limit Theorem (CLT)

Let $\mathsf{x}_1, \mathsf{x}_2, \dots, \mathsf{x}_n$ be i.i.d. variables with mean $\mu$ and covariance matrix $\Sigma$. Define the sample mean as

$$ \bar{\mathsf{x}}n = \frac{1}{n}\sum{i=1}^{n} \mathsf{x}_i. $$

Then, the central limit theorem is that

$$ \begin{aligned} \sqrt{n}(\mathsf{\bar{x}_n} - \mu) \xrightarrow{\text{D}} \mathcal{N}(0, \textstyle \Sigma) . \end{aligned}

$$

In words, CLT says that, as $n$ increases, the sample mean gets more and more close to the true mean $\mu$; the distribution of $\bar{\mathsf x}_n$ will start to look like a Gaussian centered around the true mean $\mu$ with a variance that shrinks at a rate linear in $n$.

Informally, you can think of this result as saying that for large $n$, $\mathsf{\bar{x}_n}$ is approximately sampled from $\mathcal{N}(\mu, \frac{1}{n} \Sigma) .$ We put the $\mu$ and $\sqrt{n}$ on the left in the main result so that the right-hand side is a constant. We need a constant to make it clear that we're converging to a fixed thing.