Recall linear least-squares

Recall from last time our solution to the least-squares linear regression problem. Suppose we want to minimize residual sum of squares

$$ \mathrm{RSS} (w) = \sum_{n=1}^{N} \left( y^{(n)} - w^T x^{(n)} \right)^{2}. $$

Now, we define a vector $y=(y^{(1)}, y^{(2)}, \cdots, y^{(N)})$ that contains all of the training outputs. We also define a matrix $X$ that has $x^{(n)}$ on the $n$-th row. (So $X$ has shape $N \times D$). Then we can re-wright $\mathrm{RSS}$ as

$$ \mathrm{RSS} (w) = \Vert Y - Xw \Vert^{2}. $$

Then, we observed that at the optimum, $\nabla \mathrm{RSS}(w^)=0$. We solved this for $w^$ to get

$$ w^* = (X^\top X)^{-1}(X^\top y). $$

Now, how would we solve this equation. An algorithm would have the following steps:

  1. Compute $X^\top X$. (This is a $D \times D$ matrix.)
  2. Compute $X^\top y$ (This is a length $D$ vector.)
  3. Solve $(X^\top X)w = (X^\top y)$ for $w$ (a length $D$ vector.)

The time complexity of these steps is

  1. $N D^2$ to compute $X^\top X$. (That's because this is a multiplication of a $D \times N$ matrix and a $N \times D$ matrix.)
  2. $ND$ to compute $X^\top y$. (That's because this is a multiplication of a $D \times N$ matrix and a length $N$ vector.
  3. $D^3$ to solve for $w$. (That's because this is solving a $D$-dimensional linear system of equations.)

Note that these complexities are using the simplest matrix operations.

Question: Why not simply solve $X w = y$ for $w$?

If you can, by all means do so! In general this is impossible because $N > D$.

Polynomial Example

Consider the following dataset with one-dimensional inputs and one-dimensional outputs. For reference, the true function generating the data is also shown.