Linear regression and tree visualizations

Ridge Regression

Recall that the equation for ridge regression is

$$ \begin{aligned} \hat{w}^\mathrm{ridge} =& \argmin_w \sum_{n=1}^N \left( y^{(n)} - w^\top x^{(n)} \right)^2 + \lambda \Vert w \Vert^2,

\end{aligned} $$

where $y^{(n)}$ is the output for the $n^\text{th}$ data point, $x^{(n)}$ is the $n^\text{th}$ input, $w$ are the weights, and $\lambda$ is the regularizer coefficient.

Consider a very simple dataset with a 1D input space and a single datapoint, such that $x^{(1)} = 1$ and $y^{(1)} = 2$. We can visualize each component of the loss for different values of $w$ and $\lambda$ as follows.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2c3cf03d-e6ac-4c3c-98d9-6f04dbafd0cc/ridge.svg

The red line represents the squared error $R(w)$ for different values of $w$, with a minimum at $w = 2.$ Notice that this component of the loss does not depend on $\lambda$. Each green line represents the regularizer $\lambda h(w)=\lambda w^2$ for a different value of $\lambda$. Finally, the grey lines represents the sum of these two curves, with the values of $\lambda$ shown on the right.

The minimum of each grey curve represents the optimal solution $\hat{w}^\mathrm{ridge}$ for a different value of $\lambda$. The key observation is that higher values of $\lambda$ draw $\hat{w}^\mathrm{ridge}$ closer to 0.

We consider a 2D dataset, with 1 input dimension and 1 output dimension. As in the previous lecture, $f(x)$ is a linear function fit on top of a polynomial basis expansion of $x$. For a very small value of $\lambda$, ridge regression produces the following solution:

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/765d2047-3740-4b11-9b23-ce56a02ad714/Mar_2-4.svg

Recall that a small value of $\lambda$ corresponds to a model with high capacity, and a large value of $\lambda$ corresponds to model with low capacity. One one hand, we are able to achieve low training error. However, we see some signs of overfitting. What happens as we increase $\lambda$?

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/449b77fa-e833-4e56-8455-70e51d62b044/Mar_2-5.svg

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6878d028-92c7-44a1-9ae3-020cee05d123/Mar_2-6.svg

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b5fdedb7-cc56-4c88-a5fd-f3dbd6d9a65f/Mar_2-7.svg

The training error has doubled compared to our lowest value of $\lambda$, but our solution is much smoother and less likely to be overfit. What if we keep going?

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/02f609ea-cbe0-4270-9406-f6b87688029e/Mar_2-8.svg

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/25526c58-fe77-4d57-b63b-6c19477c1916/Mar_2-9.svg

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/02669ab4-ff05-4c80-a4bd-97a4156aaa5d/Mar_2-10.svg

The coefficients go towards zero due a large regularizer value, the solution approaches a constant. Note that in this example, the bias term is not regularized, which is why the line is not close to 0. This is common practice as a constant offset does not really contribute to the "complexity" of the solution.