Recall from last time our solution to the least-squares linear regression problem. Suppose we want to minimize residual sum of squares
$$ \mathrm{RSS} (w) = \sum_{n=1}^{N} \left( y^{(n)} - w^T x^{(n)} \right)^{2}. $$
Now, we define a vector $y=(y^{(1)}, y^{(2)}, \cdots, y^{(N)})$ that contains all of the training outputs. We also define a matrix $X$ that has $x^{(n)}$ on the $n$-th row. (So $X$ has shape $N \times D$). Then we can re-wright $\mathrm{RSS}$ as
$$ \mathrm{RSS} (w) = \Vert Y - Xw \Vert^{2}. $$
Then, we observed that at the optimum, $\nabla \mathrm{RSS}(w^)=0$. We solved this for $w^$ to get
$$ w^* = (X^\top X)^{-1}(X^\top y). $$
Now, how would we solve this equation. An algorithm would have the following steps:
The time complexity of these steps is
Note that these complexities are using the simplest matrix operations.
Question: Why not simply solve $X w = y$ for $w$?
If you can, by all means do so! In general this is impossible because $N > D$.
Consider the following dataset with one-dimensional inputs and one-dimensional outputs. For reference, the true function generating the data is also shown.