In this lecture, we will continue our discussion on the Markov chain Monte Carlo (MCMC). We will start with a discussion about the Markov chains and detail their properties. Then, we will discuss the idea behind Markov chain Monte Carlo, and lastly, we discuss conditions under which MCMC can be used to solve interesting problems.
Last time, we talked about Monte Carlo methods. The idea behind MC methods is very simple—to approximate the expectation, we draw a bunch of samples and use the plug-in estimator.
Markov chain refers to a chain of transitions of random variables that are defined over the same set of states. We start with some distribution $p_0$ over the set of states. Then, as the Markov chain proceeds, the initial distribution $p_0$ transitions to $p_t$ at time $t$. The power of Markov chains lies in the fact that we can control the final distribution we converge to by controlling the transition probabilities. Here, we discuss the main elements of a Markov chain.
States. For brevity, we assume finite states—the set of states is $\{1, 2, \dots, D\}$; however, our discussion will hold more generally.
Variables. The chain is a bunch of random variables $\{\mathsf{x}_1,\mathsf{x}_2,\dots\}$, where each $\mathsf{x}_t \in \{1,2,\dots,D\}$. Each random variable is associated with its own distribution—$\mathsf x_t$ is distributed as $p_t$. Further, note that the chain need not be of finite length; that is, the chain can go on forever.
Transition Probabilities. The probabilities governing the transition at any timestamp $t$ for a transition from $\mathsf x_t$ to $\mathsf x_{t+1}$ are represented as
$$ p(\mathsf{x}_{t+1} = j | \mathsf{x}t=i) = T{ij}, $$
where $T_{ij}$ is the entry at row $i$ and column $j$ of the two-dimensional transition matrix $T$. Since we are working with finite states, we will often use the $T_{ij}$ notation; however, the left-hand side of the above equation is a more generic representation for the transition probabilities.
We say that a Markov chain has Markov property if we only have a single-step dependency. Formally, this means that we can factorize the conditional over states $x_1, x_2. \dots. x_N$ given $x_0$ as
$$ \begin{aligned}
p(x_1, x_2, \dots x_N \vert x_0) &= p(x_1 \vert x_0) p(x_2 \vert x_1) \dots p(x_N \vert x_{N-1}) \\ &= T_{x_0 x_1}\times T_{x_1 x_2} \times \dots \times T_{x_{N-1} x_N}, \end{aligned} $$
where the second equality follows from the definition of the transition probabilities.
Finally, we discussed the following recursive expression when $x_0$ is fixed.
$$ \begin{aligned} p(x_t \vert x_0) &= \sum_{x_{t-1}} p(x_{t-1} \vert x_0)T_{x_{t-1} x_t}. \end{aligned} $$
The reason we are interested in the above distribution is simple: we want to find the distribution for the latter variables in our chain (and hopefully for $x_{t_{\text{max}}}$). In today's lecture, we will continue from here and expand the above discussion to marginal distribution over $x_t$.
Last time, we discussed a recursive expression for the conditional $p(x_t \vert x_0)$. In practice, $x_0$ will not be fixed; rather, we will sample it from some distribution, so we are interested in the marginal