Linear Supervised Learning Series

Part 5: One-versus-All multi-class classification

In practice many classification problems have more than two classes we wish to distinguish, e.g., face recognition, hand gesture recognition, general object detection, speech recognition, and more. Here we discuss one way to solve classification with general $C$ classes.

Press the button 'Toggle code' below to toggle code on and off for entire this presentation.

In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

1. One-versus-All multi-class classification

  • We know how to perform two-class classification.
  • Here we learn $C$ linear classifiers (one per class), each distinguishing one class from the rest of the data.
  • The trick of the matter is how we should combine these individual classifiers to create a reasonable multi-class decision boundary.

1.1 Multi-class data

A multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consists of $C$ distinct classes of data with label values $y_{p}\in\left\{ 1,2,...,C\right\} $

In [2]:
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo1 = superlearn.ova_illustrator.Visualizer(data)

# visualize dataset
demo1.show_dataset()

1.2 Training $C$ One-versus-All classifiers

  • The first step of OvA classification is to learn $C$ two-class classifiers, with the $c^{th}$ classifier trained to distinguish the $c^{th}$ class from the remainder of the data.
  • With the $c^{th}$ two-class subproblem we simply assign temporary labels $\tilde y_p$ to the entire dataset, giving $+1$ labels to the $c^{th}$ class and $-1$ labels to the remainder of the dataset
\begin{equation} \tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases} \end{equation}
  • Now run the two-class classifier of your choice $C$ times!
In [4]:
# solve the 2-class subproblems
demo1.solve_2class_subproblems()

# illustrate dataset with each subproblem and learned decision boundary
demo1.plot_data_and_subproblem_separators()

With OvA we learn $C$ two-class classifiers - with the bias/slope weights denoted as $\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)} \right),\,\left(w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)}\right),\,\ldots,\,\left(w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}\right)$ - the $c^{th}$ of which can be written as

\begin{equation} w_0^{(c)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(c)} = 0 \end{equation}

When each subproblem is perfectly linearly separable - because of our choice of temporary labels - we always have

\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} \begin{cases} > 0 \,\,\,\,\,\,\text{if}\,\,\, y_p = c \\ < 0 \,\,\,\,\,\, \text{if} \,\,\, y_p \neq c \end{cases} \end{equation}

implying that

\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

So we know how to classify the points that we have, what about those we do not? How do we classify arbitrary points in the space of our example?

Three possible cases for an arbitrary point space:

I. It lies on the positive side of a single classifier

II. It lies on the positive side of more than one classifier

III. It lies on the positive side of no classifier

1.3 Points on the positive side of a single classifier

Those points that lie solely on the positive side of the $c^{th}$ classifier only should clearly belong to the $c^{th}$ class

Such a point $\mathbf{x}$ satisfies the condition

\begin{equation} w_0^{(c)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

Therefore to get the associated label $y$ we therefore want the maximum argument of the right hand side

\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}
In [5]:
# color those points clearly belonging to each individual class - those lying near the points we already have on the positive side of only one classifier.
demo1.show_fusion(region = 1)

1.4 Points on the positive side of more than one classifier

In [6]:
# try examining a point and its distance to relevant decision boundaries
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1])
  • We think of a classifier as being 'more confident' of the class identity of given a point the farther the point lies from the classifier's decision boundary.
  • The larger a point's distance to the boundary the deeper into one region of a classifier's half-space it lies, and thus we can be much more confident in its class identity than a point closer to the boundary
  • Another way we can think about it: if we slightly perturbed the decision boundary those points originally close to its boundary might end up on the other side of the perturbed hyperplane, changing classes.
In [7]:
# color points belonging on the positive side of two or more classifiers
demo1.show_fusion(region = 2)

Recall the formula to find the signed distance of a point to a hyperplane

\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = \frac{w_0^{(c)} + \mathbf{x}^T \mathbf{w}^{(c)}}{\left\Vert \mathbf{w}^{(c)} \right\Vert_2} \end{equation}

If we normalize the weights of each linear classifier by the length of the normal vector

\begin{equation} w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{\mathbf{w}^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \end{equation}

then this distance is simply written as

\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = w_0^{(c)} + \mathbf{x}_{\mathstrut}^T \mathbf{w}_{\mathstrut}^{(c)} \end{equation}

To assign a point in one of our current regions we seek out the classifier which maximizes this quantity

\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

This is the same rule we saw previously!

1.5 Points on the negative side of all classifiers

In [8]:
# try examining a point and its distance to relevant decision boundaries
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])
  • Here we cannot argue - as we did before - that one classifier is more 'confident' than the others.

  • But we can find the one that is the least 'unsure' about by assigning a point Not to the boundary it is furthest from, but the one it is closest to

In [9]:
# color the region on which all classifiers are negative
demo1.show_fusion(region = 3)

We can formalize this rule by noting that - once again - our reasoning has led us to assign a point to the class whose boundary is at the largest signed distance from it. Every point in the region lies on the negative side of our classifiers and all signed distances are negative. Hence the shortest distance in magnitude is the largest signed distance, being the smallest (in magnitude) negative number.

\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

1.6 Putting it all together

We have now deduced that the following single rule for assigning a label $y$ to a point $\mathbf{x}$ applies to the entire space of our problem

\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

We call this the fusion rule - since it tells us precisely how to fuse our $C$ individual classifiers together to make a unified and consistent classification across the entire space of any dataset.

One-versus-All multi-class classification


1:   Input: multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ where $y_{p}\in\left\{ 1,2,...,C\right\}$, two-class classification scheme and optimizer
2:   for $\,\,c = 1...C$
3:           form temporary labels $\tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases}$

4:           solve two-class subproblem on $\left\{ \left(\mathbf{x}_{p,}\,\tilde y_{p}\right)\right\} _{p=1}^{P}$ to find weights $w_0^{(c)}$ and $\mathbf{w}_{\mathstrut}^{(c)}$

5:           normalize classifier weights as $w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$ and $\mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$
6:   end for
7:   To assign a label $y$ to a point $\mathbf{x}$, apply the fusion rule: $y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)}$


Plotting the decision boundary

In [10]:
# classify and color the entire space using our individual classifiers and the fusion rule
demo1.show_complete_coloring()
  • The boundary resulting from the fusion rule is always piecewise-linear
  • No straight-forward closed form formula

Example 1: Classifying a dataset with $C = 4$ classes using OvA

In [18]:
# load in dataset
data3 = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/4class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo3 = superlearn.ova_illustrator.Visualizer(data3)

# visualize dataset
demo3.show_dataset()

Note: with this dataset that each class is not linearly separable from the remainder of the data. OvA works nonetheless.

In [20]:
# solve the 2-class subproblems
demo3.solve_2class_subproblems()

# classify and color the entire space using our individual classifiers and the fusion rule
demo3.show_complete_coloring()

1.7 Counting misclassifications and the accuracy of a multi-class classifier

Taking the input of the $p^{th}$ point $\left(\mathbf{x}_p,\,y_p\right)$ we use the fusion rule to produce a predicted output $\hat y_p$ as

\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

We can then write the total number of misclassifications on our training set as

\begin{equation} \text{number of misclassifications on training set } = \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}

with the accuracy being then calculated as

\begin{equation} \text{accuracy of learned classifier} = 1 - \frac{1}{P} \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}