We will start this lecture by finishing the discussion of broad concepts. Then, we will begin the first module of the course, on regression. In particular, we will talk about using K-nearest neighbors for regression

First, let's continue our conversation from last time about broad concepts.

6. Model specification

We often do some curve fitting with the data. For example, we might fit a straight line through it. But there's a question: is the best possible curve is actually a straight line? Would you still want to use a straight line if you had access to 100 times as much data?

Let $f^$ be the true target function, the best possible predictor. We say that a model is well-specified if this function is contained in the hypothesis space, i.e. if $f^ \in \mathcal{H}$.

Last time we talked about overfitting and underfitting. In general, using a larger hypothesis class is better, because it allows you to capture more of the true complexity of the data. Of course, a larger hypothesis space also leads to more potential overfitting. Thus, we often use larger and large hypothesis spaces as we have more and more data. The idea of the model specification is that at some point this whole process might stop! In principle, at some point, we might make our hypothesis space large enough that the best function is contained in it. If so, the advantage of increasing the hypothesis space disappears. This point is when the model has become "well-specified".

The terminology of "well-specified" is a bit misleading because it sounds like not being well-specified is some sort of sin. Well-specified models are reasonable to expect in, say, physics, where we have a deep understanding of the underlying phenomenon. In machine learning, the reality is that it's pretty rare for our models to actually be well-specified. Usually, even if a straight line is a pretty good fit, there's some quadratic with a little bit of curvature that would be at least slightly better. And if you have a quadratic, there's probably some cubic that's better. Or if you fit a neural network with 1000 hidden units, there probably one with 1100 that's still better. We usually restrict ourselves to simpler models because we're worried about overfitting or computational complexity, not because we've actually captured the true function.

7. Computational and statistical trade-offs

In supervised learning, we basically pick some function $f \in \mathcal{H}$ that fits well to the training data. The hypothesis space $\mathcal{H}$ was chosen by us. How did we pick it?

Whenever we choose a hypothesis space, we need to worry about two things:

We want good predictions. (Statistical goal)
We want to be able to work with $\mathcal{H}$ efficiently (Computational goal)

These are the two different facets of machine learning: Getting good predictions, and doing so while consuming a reasonable amount of electricity. There's often a tension between these two goals!

Take some of the common hypothesis spaces in machine learning, say a set of neural networks, or a set of classification trees, or a set of linear functions. It's important to remember, these were invented because they seem to perform well along both axes on some problems.

It's very likely the case that there are other hypothesis spaces out there that would give great predictions statistically, but we never even think about them because we don't have good algorithms. We need to make predictions with the tools we're actually able to wield.

8. "That's user specified"

Doing machine learning involves many choices, such as

Loss functions;
Hyperparameters;