Linear regression and tree visualizations

Ridge Regression

Recall that the equation for ridge regression is

$$ \begin{aligned} \hat{w}^\mathrm{ridge} =& \argmin_w \sum_{n=1}^N \left( y^{(n)} - w^\top x^{(n)} \right)^2 + \lambda \Vert w \Vert^2,

\end{aligned} $$

where $y^{(n)}$ is the output for the $n^\text{th}$ data point, $x^{(n)}$ is the $n^\text{th}$ input, $w$ are the weights, and $\lambda$ is the regularizer coefficient.

Consider a very simple dataset with a 1D input space and a single datapoint, such that $x^{(1)} = 1$ and $y^{(1)} = 2$. We can visualize each component of the loss for different values of $w$ and $\lambda$ as follows.

The red line represents the squared error $R(w)$ for different values of $w$, with a minimum at $w = 2.$ Notice that this component of the loss does not depend on $\lambda$. Each green line represents the regularizer $\lambda h(w)=\lambda w^2$ for a different value of $\lambda$. Finally, the grey lines represents the sum of these two curves, with the values of $\lambda$ shown on the right.

The minimum of each grey curve represents the optimal solution $\hat{w}^\mathrm{ridge}$ for a different value of $\lambda$. The key observation is that higher values of $\lambda$ draw $\hat{w}^\mathrm{ridge}$ closer to 0.

We consider a 2D dataset, with 1 input dimension and 1 output dimension. As in the previous lecture, $f(x)$ is a linear function fit on top of a polynomial basis expansion of $x$. For a very small value of $\lambda$, ridge regression produces the following solution:

Recall that a small value of $\lambda$ corresponds to a model with high capacity, and a large value of $\lambda$ corresponds to model with low capacity. One one hand, we are able to achieve low training error. However, we see some signs of overfitting. What happens as we increase $\lambda$?

The training error has doubled compared to our lowest value of $\lambda$, but our solution is much smoother and less likely to be overfit. What if we keep going?

The coefficients go towards zero due a large regularizer value, the solution approaches a constant. Note that in this example, the bias term is not regularized, which is why the line is not close to 0. This is common practice as a constant offset does not really contribute to the "complexity" of the solution.