In today's lecture, we will continue where we left off regarding cross validation. We will go over a few different methods, and then discuss which work the best. Then, we will introduce another kind of supervised learning—classification.
Last time, we discussed the idea of model selection. Throughout the course, we have found that our models often have free parameters that we do not know how to select—the K in K-NN, the depth of a regression tree, $\lambda$ in ridge regression, etc. Each different choice of these parameters produces a different model or "predictor." Model selection refers to method we use for selecting these parameters.
In many cases, this choice reflects a choice about the capacity of the model—for example, as $\lambda$ increases, the capacity of a ridge regression model typically decreases. We discussed how choices about capacity, in turn, correspond to a tradeoff between overfitting and underfitting. This tradeoff turns out to be extremely important—so important that it is worth developing another set of algorithms just for making this decision.
Our basic strategy to do this is to hold out some of the data, which we call a "validation set". Then try out all the models, and see which one performs the best. We called this "Recipe 1: Train-Validation." While this idea is very simple, it is also very effective.
Evaluating our model on a set of data which it was not trained on allows us to capture properties of both underfitting and overfitting, and choose our model capacity appropriately. However, as we discussed in the previous lecture, this recipe introduces some bias because the model we select is eventually trained on both the training set and the validation set, meaning that it is ultimately trained on more data than it was during model selection, meaning that we will often end up choosing a model which has lower-than-optimal capacity. Further, because the process is innately stochastic, we suffer from some variance resulting from the finite size of the training data and the randomness in how we generate our split.
Therefore, we would like to improve on this strategy. One potential improvement is to split the data into $K$ disjoint "folds." For each fold, we hold out that fold, training each model on the rest of the folds, and then evaluate the trained model on the held out fold, measuring the error. Then, for each model we average the error as measured over all of the folds. Let us call this strategy "Recipe 2: K-Fold Validation." (See the Lecture 11 notes for more details.)
One important thing to note here is that it is always "statistically" better to carry out K-fold cross validation than to use a single train-validation split. However, it is significantly more expensive computationally as compared to our train-validation strategy. If we have $M$ different models, the train-validation strategy is $O(M)$ whereas K-fold cross validation is $O(MK).$ However, if you have the computational resources available, you should always prefer K-fold.
Student Question: What do you mean by a "predictor?"
By "predictor" or "model" we mean a fully-specified algorithm with no free parameters. For example, "ridge regression" isn't fully-specified because you need to define how you are representing the function (e.g., a polynomial of a specific order) and $\lambda$.
In other words, a "predictor" or a "model" is an algorithm for approximately solving the equation
$$ \min_{f \in \mathcal{H}} \sum_{n=1}^{N} L(y^{(n)}, f(x^{(n)})) + \lambda J(f). $$
Everything in the above equation is fixed: The hypothesis space $\mathcal{H}$, the loss function $L$, the regularization parameter $\lambda$, the regularization function $J$ as well as the algorithm that generates the predictions.
In the previous lecture, we left off with a riddle. If we use K-fold cross validation with $K=N$, we call this leave-one-out cross validation. This is expensive because the computational cost of K-fold cross validation scales with $K$. However, recall that cross validation introduces some bias due to the fact that we are training on less data during cross validation than we do during the final step, where we retrain the chosen predictor on all of the available training data (i.e., all folds). So by using leave-one-out cross validation, we are minimizing the bias. The question is, if we can afford the computational cost, should we always use leave-one-out cross validation instead of some other $K?$
The answer here is surprisingly "no," but the reasoning is subtle. The problem is this: Almost all training datasets of size $N-1$ are almost the same. This means that results depend a lot on the vagaries of the particular training set. Hence, we're not really testing different scenarios—we will generate almost the same solution every time. For example, if we had 1 million data, holding out a single datum will not change the result very much.
One way of looking at this is to consider the system a "two-level" learner. We have the "base learner" (e.g., ridge regression), and the "high-level learner," which here is K-fold cross validation. The high-level learner faces some of the same types of tradeoffs as the base learner, e.g., the tradeoff between overfitting and underfitting.
In general, choosing $K$ represents a tradeoff between the bias and the randomness of the cross-validation procedure. We have the following result: