This lecture will continue our discussion on linear regression. We will talk about Ridge and Lasso regression and discuss their advantages and disadvantages. We will then start to talk about about regression trees.

Ridge Regression

Recall that ridge regression is a method to do linear regression with the least-squares loss, while penalizing the square of all errors. We can write it as

$$ \hat{w}^{\mathrm{ridge}} = \argmin_{w} \Vert y - Xw \Vert^{2} + \lambda \Vert w \Vert^{2}, $$

where $\lambda$ is the regularization constant. Note that the above equation is represented in the vectorized form.

As $\lambda$ gets larger, the values of the resulting weights will tend to get smaller. That's because our objective is a combination of one part ($\Vert y - Xw \Vert^{2}$) that measured training error and another part ($\Vert w \Vert^2$) that measures how big $w$ is. As you change $\lambda$ you change the relative priorities of these two goals.

Now that we have defined the form of ridge regression, we would like to get a closed-form solution. We start by writing down the Regularized Residual Sum of Squares (RRSS) for the above equation, which is a function of $w$. We can write this as

$$ \mathrm{RRSS}(w) = (y - Xw)^\top (y- XW) + \lambda w^\top w. $$

Essentially, we have expanded the Euclidean norm in the original equation for the ridge regression by writing it out as the inner product with itself.

Using matrix calculus, we can simplify the above equation to the following form

$$ \nabla \mathrm{RRSS}(w) = -2 X^\top(y - Xw) +2\lambda w. $$

Note that the only difference between the above and the output from derivation for linear regression that we saw before is the second term. To find the optimal value of the weights $w^{*}$, we will set the gradient to 0.

$$ \nabla \mathrm{RRSS}(w^) = -2 X^\top(y - Xw^) +2\lambda w^* = 0 $$

At $w = w^{*}$, after solving the above, we get the following equation

$$ X^\top y = X^\top X w^* + \lambda w^*. $$

Note that in the above equation, the first term on the left-hand side is a matrix multiplication between $X^\top X$ and $w$ while in the second term, we are multiplying $w$ with a scalar $\lambda$. Hence to solve the above equation for $w$, we rewrite the above equation to an equivalent form. We can multiply the regularization constant with the identity matrix, $I$ resulting in the following equation

$$ X^\top y = X^\top X w + (\lambda I) w. $$

Finally, after further simplifying the above equation, we get the following form for the optimal weights,

$$ \boxed{\hat{w}^{\mathrm{ridge}} = (X^\top X + \lambda I)^{-1} (X^\top y)}. $$

Recall that our estimate for the optimal weights for regular least squares (without the regularization term) was