In today's lecture, we will continue talking about K-Nearest Neighbors (KNN) and also have a discussion on the mathematical background for matrix calculus.

First we recap over the formula for KNN. Then we go over three examples where KNN can be used, and what are the different issues we come across when using this algorithm.

We then discuss data augmentation which is used to improve performance on test data. We then discuss the idea of time complexity and how it influences what algorithm we use.

1. Recap

As you may remember, to make a prediction for an input $x$, we find the K closest points in the training data and take their average, that is

$$ f(x)=\frac{1}{K}\sum_{n\in \mathrm{Neigh}_{K}(x)}{y^{(n)}}, $$

where $\mathrm{Neigh}_{k}(x)$ are the nearest neighbors of the input point $x$.

Usually, we use a distance function to calculate the nearest neighbors of the new input. We can use Euclidean distance, but there are other distance functions too.

2. K-NN - Example 1

Imagine we have a set of data points, with a one-dimensional input, plotted on the x-axis, and a one-dimensional output plotted on the y-axis.

Below, we show some datapoints, along with the prediction function $f(x)$ for various values of $K$. The input values are plotted as dots, the output value is the regression line in black.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/da968d3a-a3a9-4bea-8af6-20113658bd95/knn_1d.svg

Now ask yourself, what value of $K$ do you think would generalize the best to new data? After conducting a breakout room discussion where students discussed among themselves what the ideal value of K would be and how to pick one, we did a poll to find the student consensus on its value. Most students agreed that a value of $11$ or $12$ would perform well on the test data.

Here are the actual results for test error, plotted against the different values of K.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9d0dd3e5-4f7e-41b9-908d-4c1135898ef7/knn_test_err.svg

We see that only a $K=5$ is gives the best generalization performance. We can see from the above figures that the curve with $K=5$ looks rather ragged. This shows you that it's not easy to guess what will generalize best.

By the way, we do have pretty good techniques to find a good value of $K$ automatically. The simplest version is to just hold out a part of the data to see which value performs best. We'll talk about this in a few lectures.

Another issue worth pondering is whether we increase the training data, whether the optimal value of $K$ will change. Generally, the optimal $K$ will tend to get larger as we have more training data. Intuitively, when the number of data is small, the neighbors can be far away, so K-NN is "smoothing" over very large regions, which tends to cause underfitting. With more and more datapoints, the neighbors for any given value of $K$ tend to be closer. So it typically becomes beneficial to increase $K$ to better combat overfitting.

Another issue is what happens as the dimensionality of the data increases. The answer is that the performance of K-NN typically degrades as the number of dimensions increases, due the curse of dimensionality that we saw before. This means that as the dimension increases, more and more data tend to be necessary to get good predictions.