Today we're going to talk more about cross validation, which is probably the most important concept in the course! We mean an algorithm for doing model selection by splitting data into training and validation sets. This includes train-test validation, K-fold cross validation, and other variants.

Cross-Validation

Continuing from last time, we take some number of coins, flip each of them a number of times, and then pick the coin which had the most heads.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/fdf0ef95-3dde-46ae-86dd-2e372c48189e/coin_flips.svg

The critical observation is that if you do this, your estimate of the chosen coin is biased. By chance, some coins will do better than average. By picking the best coin, you are likely to pick a coin that over-performed. The performance in the future will, on average, be somewhat worse. So there's a bias here. The bias is larger when you have a larger a number of coins, and smaller when you have more coin flips.

Q: Could you go over why it would increase with more coins?

If you only had a single coin, then the process is unbiased because there is no selection process. When you have more and more coins, there's more chances that a coin will do much better than it's true heads probability just by chance. Suppose each coin was fair. If you flip enough coins, say 10 times, then eventually you will get one that is all heads, maximizing the amount of bias (the empirical probability of heads is 1, the true probability is 0.5). That's the intuitive reason. There are more chance for a coin that looks good even though it isn't actually.

Q: What if all coins have the same true probabilities?

Then this is still true: We will overestimate the coin we select.

Q: Does it increase the bias as well?

That is actually a very subtle issue. I think the answer is that when all coins have similar true probabilities the bias does tend to be a bit higher. Imagine all of the coins except one had a heads probability of 0.1, and one had a probability of 0.9. It would be unlikely for any of the bad coins to outperform the good coin. If every coin is close, there are more chances for random chances to occur. Anyway, this is a subtle issue that we don't need to focus on too much.

Analogy

To apply this to machine learning, we can think about the following analogy:

Coins → $f$ in the hypothesis space $\mathcal{H}$

Coin flips → data in the training set

In supervised learning, we want a good function $f$ which generalizes well to test data. We see how well each function in the hypothesis space, does on each of the training data, and then pick the best. (In this analogy we ignore regularization.) Every element in the training set is another test to see how well our functions work.

The different between the training and test performance tends to:

You should be able to see this by analogy. Having a larger hypothesis space is analogous to having more coins. Having more training data is analogous to having more coin filps. When we have a larger hypothesis space, there are more functions that might do well on the training data just by chance. When we have more training data all the estimates become more accurate

If bias increases when we make $\mathcal{H}$ larger, why not just make $\mathcal{H}$ small?

The students made the following guesses: