1. Revisiting KL Divergence and Monte Carlo

In the last lecture we discussed about two important tools. First, we talked about the Kullback-Leibler (KL) Divergence which is a way to measure how different two distributions are. Formally,

$$ \begin{aligned}

KL(p||q) = \sum_x p(x) \log \frac{p(x)}{q(x)} .

\end{aligned} $$

Recall that KL Divergence is not symmetric. That is, in general $KL(p||q) \neq KL(q||p)$. (They might be equal in special cases, for example if $p=q$ then both divergences are zero.)

Second, we talked about Monte-Carlo (MC) approximations. The idea is that $\mathbb{E}[f(x)]$ can be approximated by drawing a set of $N$ samples and evaluating the function on these samples. That is, if $x_n \sim p$, then

$$ \begin{aligned} \forall f(x),\ \ \mathbb{E}[f(x)] \approx \frac{1}{N} \sum_{n=1}^N f(x_n) \end{aligned}. $$

Next, we will discuss why counting is still a reasonable way to approximate conditional distributions even if some of the conditional independencies (CIs) asserted are not true.

2. Maximum Likelihood

Setup: Suppose we are given a dataset $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$ sampled from true distribution $p^$ and we would like to minimize $KL(p^||p_w)$. Formally, we would like to minimize

$$ \begin{aligned}

KL(p^||p_w) = \sum_x p^(x) \log \frac{p^*(x)}{p_w(x)} .

\end{aligned} $$

We cannot directly minimize this because we don't know $p^$. However, we do have independent and identically distributed (i.i.d.) data points sampled from $p^$. Thus, with the dataset $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$ we can approximate expectations w.r.t $p^*$ using Monte-Carlo approximation we discussed in last lecture. The approximated KL divergence can then be written as

$$ \begin{aligned}

KL(p^||p_w) \approx& \frac{1}{N} \sum_{n=1}^N \log \frac{p^(x^{(n)})}{p_w(x^{(n)})} \\ =& \frac{1}{N} \sum_{n=1}^N \log p^*(x^{(n)}) - \frac{1}{N} \sum_{n=1}^N \log p_w(x^{(n)}). \end{aligned} $$

We still can't compute the above approximation, because we don't have $p^$. Nevertheless, we can still minimize it because $p^$ does not depend on $w$. Formally, we can write that

$$ \begin{alignedat}{2} \argmin_w \frac{1}{N} \sum_{n=1}^N \log \frac{p^(x^{(n)})}{p_w(x^{(n)})} =& \argmin_w \frac{1}{N} \sum_{n=1}^N \log p^(x^{(n)}) - \frac{1}{N} \sum_{n=1}^N \log p_w(x^{(n)}) \\ =& \argmax_w \frac{1}{N} \sum_{n=1}^N \log p_w(x^{(n)}). \end{alignedat} $$

This last expression is a maximization of the log-likelihood.

In general, given a dataset $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$, the likelihood of parameters $w$ is $\prod_{n=1}^N p_w(x^{(n)})$.

Conclusion

Therefore, maximizing the log-likelihood of the data is equivalent to minimizing the Monte-Carlo approximation of KL divergence.

Takeaway: The important takeaway from this discussion is that given the dataset $x^{(1)}, x^{(2)}, \cdots, x^{(N)}$ it's reasonable to choose $w$ by the following criterion:

$$ \begin{aligned}

\arg\max_w \frac{1}{N} \sum_{n=1}^N \log p_w(x^{(n)}).

\end{aligned} $$