In this lecture, we are going to finish Maximum likelihood(ML) learning in directed models. Then we are going to talk about ML learning in undirected models. In the end, we will learn about Hammersley-Clifford theorem.
In last class, we derived that, If we do maximum likelihood learning in discrete fully observed directed models then we get the counting result. Final result we derived was as follows
$$ \begin{aligned} w {x_i | x{\mathrm{pa}(i)}}^{\mathsf{x}i | \mathsf{x}{\mathrm{pa}(i)}} = \frac{\#(x_i, x_{\mathrm{pa}(i)})}{\# (x_{\mathrm{pa}(i)})} \end{aligned}. $$
Setup: Consider another simple example of learning directed models. In this example, we have 3 variables in a graph shown in Fig 1, and they are in a "collider " rule relationship.
**Fig 1**: Simple DAG
Dataset: We have 8 training data points as per shown in the provided table
$$ \begin{array}{|c|c|c|c|}\hline \text{n} & x_1^{(n)}& x_2^{(n)} & x_3^{(n)} \\\hline 1&0&0&0\\\hline 2&0&1&0\\\hline 3&1&0&0\\\hline 4&0&1&1\\\hline 5&1&0&1\\\hline 6&1&1&0\\\hline 7&1&1&1\\\hline 8&1&0&0\\\hline
\end{array} $$
Solution: Let's apply our counting principle to find out the probability distribution.
The marginal distribution $p_w(x_1)$ and $p_w(x_2)$ are
$$ \begin{array}{l r}
p_w(\mathsf{x}{1}=0) = w_0^{\mathsf{x}{1}} = \frac{3}{8}
\ \ \ \ \ \ \ \ \ \ &
\ \ \ \ \ \ \ \ \ \
p_w(\mathsf{x}{2}=0) = w_0^{\mathsf{x}{2}} = \frac{1}{2}\\
p_w(\mathsf{x}{1}=1) = w_1^{\mathsf{x}{1}} = \frac{5}{8} & p_w(\mathsf{x}{2}=1) = w_1^{\mathsf{x}{2}} = \frac{1}{2} \\ \end{array} $$
The conditional distribution $p_w(x_3|x_1,x_2)$ is
$$ \begin{array}{l}
p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=0, \mathsf{x}{2}=0) = w{0|0,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = 1 \\
p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=0, \mathsf{x}{2}=0) = w{1|0,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = 0 \\
p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=1, \mathsf{x}{2}=0) = w{0|1,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{3} \\
p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=0) = w{1|1,0}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{2}{3} \\
p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=0, \mathsf{x}{2}=1) = w{0|0,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\
p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=0, \mathsf{x}{2}=1) = w{1|0,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\
p_w(\mathsf{x}{3}=0 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = w{0|1,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} \\
p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = w{1|1,1}^{\mathsf{x}{3}|\mathsf{x}{1}, \mathsf{x}_{2}} = \frac{1}{2} .\\ \end{array}
$$
Note: e.g. $p_w(\mathsf{x}{3}=1 | \mathsf{x}{1}=1, \mathsf{x}{2}=1) = \frac{\#(\mathsf{x}3=1, \mathsf{x}{1}=1,\mathsf{x}{2}=1)}{\#(\mathsf{x}{1}=1,\mathsf{x}{2}=1)}.$
Let's summarize learning in directed model.