1. Revisiting learning in directed model

In this lecture, we are going to finish Maximum likelihood(ML) learning in directed models. Then we are going to talk about ML learning in undirected models. In the end, we will learn about Hammersley-Clifford theorem.

In last class, we derived that, If we do maximum likelihood learning in discrete fully observed directed models then we get the counting result. Final result we derived was as follows

$$ \begin{aligned} w {x_i | x{\mathrm{pa}(i)}}^{\mathsf{x}i | \mathsf{x}{\mathrm{pa}(i)}} = \frac{\#(x_i, x_{\mathrm{pa}(i)})}{\# (x_{\mathrm{pa}(i)})} \end{aligned}. $$

Learning Example

Setup: Consider another simple example of learning directed models. In this example, we have 3 variables in a graph shown in Fig 1, and they are in a "collider " rule relationship.

                                                           **Fig 1**:  Simple DAG

Dataset: We have 8 training data points as per shown in the provided table

$$ \begin{array}{|c|c|c|c|}\hline \text{n} & x_1^{(n)}& x_2^{(n)} & x_3^{(n)} \\\hline 1&0&0&0\\\hline 2&0&1&0\\\hline 3&1&0&0\\\hline 4&0&1&1\\\hline 5&1&0&1\\\hline 6&1&1&0\\\hline 7&1&1&1\\\hline 8&1&0&0\\\hline

\end{array} $$

Solution: Let's apply our counting principle to find out the probability distribution.

The marginal distribution $p_w(x_1)$ and $p_w(x_2)$ are

$$ \begin{array}{l r}

p_w(\mathsf{x}{1}=0) = w_0^{\mathsf{x}{1}} = \frac{3}{8} \ \ \ \ \ \ \ \ \ \ & \ \ \ \ \ \ \ \ \ \
p_w(\mathsf{x}{2}=0) = w_0^{\mathsf{x}{2}} = \frac{1}{2}\\

p_w(\mathsf{x}{1}=1) = w_1^{\mathsf{x}{1}} = \frac{5}{8} & p_w(\mathsf{x}{2}=1) = w_1^{\mathsf{x}{2}} = \frac{1}{2} \\ \end{array} $$

The conditional distribution $p_w(x_3|x_1,x_2)$ is

$$ \begin{array}{l}

p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=0, \mathsf{x}{2}=0) = w{0|0,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = 1 \\

p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=0, \mathsf{x}{2}=0) = w{1|0,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = 0 \\

p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=1, \mathsf{x}{2}=0) = w{0|1,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{3} \\

p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=0) = w{1|1,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{2}{3} \\

p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=0, \mathsf{x}{2}=1) = w{0|0,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\

p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=0, \mathsf{x}{2}=1) = w{1|0,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\

p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = w{0|1,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\

p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = w{1|1,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} .\\ \end{array}

Note: e.g. $p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = \frac{\#(\mathsf{x}3=1, \mathsf{x}{1}=1,\mathsf{x}{2}=1)}{\#(\mathsf{x}{1}=1,\mathsf{x}{2}=1)}.$

Summary of Learning in directed models

Let's summarize learning in directed model.

Maximizing the likelihood is equivalent to minimizing a Monte Carlo approximation of the KL divergence.
Minimizing the KL divergence is a "reasonable" thing to do, even if there is model error/misspecification.