Today, we will finish our discussion on linear classification, ending our study of classification as such. (It was fast because it's all like regression!) Then, we introduce the concept of automatic differentiation or autodiff. Autodiff is the algorithmic basis of neural networks.

Concepts in Classification: Recap

Till now, we have discussed classification under the assumption the output is binary, i.e., $y \in \{ -1, 1 \}$. Under this assumption, we observed that we don't want to try to make our functions $f(x)= w^\top x$ to output binary numbers. Linear functions have a very difficult time doing that! Instead, we allow $f(x)$ to output real numbers, and interpret the output as follows:

This is a major difference between how linear functions work for classification and regression. With regression, we want the output of $f(x)$ to be as close as the true output $y$ as possible. With classification, we interpret the output differently. This isn't a big problem. We just need to define the loss functions for classifications that look at the true output along with the sign and sign and magnitude of $f(x)$ to determine if the prediction is good. Broadly speaking, more confident predictions are better when they are right, and worse when they are wrong.

The two loss functions that we have looked so far are Hinge loss and Logistic loss. A toy visualization for the above is as follows (compared for reference against the 0-1 loss which isn't practical in most cases).

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8f2dbac9-9fd1-4218-a9d1-9654a1e67f19/all_losses.svg

Most machine learning methods for classification work like this - we have real numbers as outputs, and then we interpret them as confidence measures. Classification trees and K-NN are sort of anomalies in that they can directly predict discrete outputs.

Student questions:

1). Does Logistic loss upper bound the 0-1 loss?

No, that is not the case. Of course, you could make it upper-bound the 0-1 loss by multiplying it by a small constant, but we won't do this.

2). Is it a better practice to have output labels $y \in \{ -1, 1 \}$ rather than $y \in \{ 0, 1 \}$ ?

It doesn't matter. We chose $y\in\{-1,+1\}$ because it allows the intuitive interpretation as the sign of $f(x)$. But you could redefine these losses for to work with $y \in \{0,1\}$ and the performance would be the same.

Multiclass Classification

From the previous discussions, one question naturally emerges is "How do we perform the classification task when we have more than two categories? ". To answer this question, we will discuss how to perform multi-label classification.

First, an aside: Sometimes for multiclass classification, people use strategies called "one-vs-all" or "one-vs-one". These reduce multi-class classification to binary classification. We won't be covering them in this course. While these methods can work OK, your instructor feels that these methods below are usually better, so there's rarely a reason to use one-vs-all or one-vs-one. (This opinion was expressed somewhat less politely in lecture.)

Now, let us move to the question of multiple classes. That is, what if $y \in \{ 1, 2, 3, \dots V\}$? How can we do linear classification?

Intuitively, we can just learn $V$ different linear functions. This will give us $V$ different outputs for a given data sample $x$. We interpret these as meaning that the largest value is the prediction, while the magnitudes reflect confidences.