In this lecture, we will discuss some of the broad concepts common across all machine learning algorithms. These concepts provide a general framework for understanding different methods. Specifically, we will discuss:

  1. Loss functions

  2. Overfitting/Underfitting

  3. Feature Engineering

  4. Regularization

  5. The curse of dimensionality

The goal of this lecture is to explain how these concepts can be used to evaluate a given machine learning pipeline and compare different methodologies.

1. Loss Functions

In a general learning setup, we are provided with some input data, that we want to use to predict some specified output. For any predictor that we choose to perform the above task, we can always expect the predictor to make errors in its prediction. This prediction error can thus be seen as a measure of performance of our machine learning algorithm. Furthermore, since our objective is to obtain a predictor that has the least prediction error on the set of unseen data samples, this error can be used as a signal to 'train' our machine learning algorithm. Hence, the prediction error estimation plays a key role in defining our machine learning pipeline.

The error between the predicted values and the true outputs can measured in different ways. For example, let us assume that we are provided with a dataset

$$ \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), .....(x^{(N)}, y^{(N)})\}, $$

where each $x^{(n)}$ is an the input and $y^{(n)}$ is the corresponding true output. Assume for now that the outputs $y^{(n)}$ are real numbers.

Now suppose we have some predictor $f(x)$. How can we measure how well it matches to the data. Two natural ways to do this would be the "absolute error"

$$ \sum_{n=1}^{N}\vert y^{(n)} - f(x^{(n)}) \vert, $$

and the squared error

$$ \sum_{n=1}^{N} \left(y^{(n)} - f(x^{(n)}) \right) ^{2}. $$

Both of these are valid ways to measure our prediction error. Depending on the application, either might be preferable. There's no single best way to measure prediction errors. Selecting a method depends on our priorities with respect to the learning problem.

As a formal definition, we define these methods of error estimation between the predicted value and the true output as 'Loss functions'. Notationally, for true output $y$ and predicted value $f(x)$, the loss function is written as

$$ L(y, f(x)). $$