1. Typo Corrector

To begin, let's consider a slightly contrived problem — to build a typo corrector.

Setting. ****Suppose we have a database of words with at most $D$ letters, such as

$$ \mathrm{duck},\ \mathrm{pile}, \mathrm{an⋆⋆},\ \mathrm{dive},\ \cdots $$

where $\mathrm{⋆}$ is used to pad words with less than $D$ letters (in this example $D=4$).

Problem. We see "noisy" words where each letter has a 25% chance of corrupted. If a letter is corrupted, it is replaced by a new random letter (which has a small chance of being the original one). For example, if the original word was $\mathrm{frog}$, we might instead see $\mathrm{fort}$, or if the original word was $\mathrm{nice}$, we might see $\mathrm{n⋆ce}$.

Objective: Given a noisy word, we'd like to predict what the original clean word was. What would be the probabilistic approach to this problem?

A probabilistic approach would consists of 3 steps.

Step 1

We build a distribution $p(x)$ over all length $\mathrm{D}$ sequences in the database. Each word or sequence is represented by variable $x=(x_1, x_2, \cdots, x_D)$ where each letter is $x_i \in \{\mathrm{a}, \mathrm{b}, \cdots, \mathrm{z}, \mathrm{⋆}\}$. After we've created it, $p(x)$ is a measure of how likely $x$ is to occur as a word in a random English text.

For example, we could have something like this:

$$ \begin{aligned} p(\mathsf{x}&=(\mathrm{a},\mathrm{a},\mathrm{a},\mathrm{a}))&=&\ \ 0.000001 \\ p(\mathsf{x}&=(\mathrm{a},\mathrm{a},\mathrm{a},\mathrm{b}))&=&\ \ 0.000002\\ &&\vdots&\\ p(\mathsf{x}&=(\mathrm{t},\mathrm{a},\mathrm{c},\mathrm{o}))&=&\ \ 0.00531\\ &&\vdots&\\ \end{aligned}

$$

This reflects that it's (very) rare to use a sequence like $\mathrm{aaaa}$, or $\mathrm{aaab}$ in English, while a sequence like $\mathrm{taco}$ occurs frequently.

Step 2

We build a conditional distribution $p(y|x)$ of the "noisy" sequences $y$ given "clean" sequences $x$. In this case, the conditional distribution is

$$ p(y|x) = \prod_{i=1}^D \left(0.75 \times \mathbb{I}[y_i = x_i] + 0.25 \times \frac{1}{27}\right). $$

We have a product of $D$ elements because noise is being applied (or not) independently to each letter. Here $0.75$ represents the probability that the letter is unchanged. Meanwhile, $0.25$ represents the probability that a letter is corrupted. In that case, there is a uniform probability over each of the $27$ options (each of the $26$ English letters plus the $\mathrm{⋆}$ character which is used to pad the sequences.)

Step 3

We use the two distribution from step 1 and step 2 to get the most likely solution for the problem: given a new "noisy" sequence $\mathsf{y}$, we want to predict a "clean" sequence $\mathsf{x}$. Bayes's equation tells us that

$$ p(x|y) = \frac{p(x)p(y|x)}{p(y)}. $$

So we might predict the most likely $x$ using

$$ \argmax_x p(x|y) = \argmax_x p(x)p(y|x), $$