In practice many classification problems have more than two classes we wish to distinguish, e.g., face recognition, hand gesture recognition, general object detection, speech recognition, and more. Here we discuss one way to solve classification with general $C$ classes.
Press the button 'Toggle code' below to toggle code on and off for entire this presentation.
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
A multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consists of $C$ distinct classes of data with label values $y_{p}\in\left\{ 1,2,...,C\right\} $
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')
# create an instance of the ova demo
demo1 = superlearn.ova_illustrator.Visualizer(data)
# visualize dataset
demo1.show_dataset()
# solve the 2-class subproblems
demo1.solve_2class_subproblems()
# illustrate dataset with each subproblem and learned decision boundary
demo1.plot_data_and_subproblem_separators()
With OvA we learn $C$ two-class classifiers - with the bias/slope weights denoted as $\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)} \right),\,\left(w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)}\right),\,\ldots,\,\left(w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}\right)$ - the $c^{th}$ of which can be written as
\begin{equation} w_0^{(c)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(c)} = 0 \end{equation}When each subproblem is perfectly linearly separable - because of our choice of temporary labels - we always have
\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} \begin{cases} > 0 \,\,\,\,\,\,\text{if}\,\,\, y_p = c \\ < 0 \,\,\,\,\,\, \text{if} \,\,\, y_p \neq c \end{cases} \end{equation}implying that
\begin{equation} w_0^{(c)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}So we know how to classify the points that we have, what about those we do not? How do we classify arbitrary points in the space of our example?
Three possible cases for an arbitrary point space:
I. It lies on the positive side of a single classifier
II. It lies on the positive side of more than one classifier
III. It lies on the positive side of no classifier
Those points that lie solely on the positive side of the $c^{th}$ classifier only should clearly belong to the $c^{th}$ class
Such a point $\mathbf{x}$ satisfies the condition
\begin{equation} w_0^{(c)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(c)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}Therefore to get the associated label $y$ we therefore want the maximum argument of the right hand side
\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}# color those points clearly belonging to each individual class - those lying near the points we already have on the positive side of only one classifier.
demo1.show_fusion(region = 1)
# try examining a point and its distance to relevant decision boundaries
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1])
# color points belonging on the positive side of two or more classifiers
demo1.show_fusion(region = 2)
Recall the formula to find the signed distance of a point to a hyperplane
\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = \frac{w_0^{(c)} + \mathbf{x}^T \mathbf{w}^{(c)}}{\left\Vert \mathbf{w}^{(c)} \right\Vert_2} \end{equation}If we normalize the weights of each linear classifier by the length of the normal vector
\begin{equation} w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{\mathbf{w}^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2} \end{equation}then this distance is simply written as
\begin{equation} \text{signed distance of $\mathbf{x}$ to $c^{th}$ boundary} = w_0^{(c)} + \mathbf{x}_{\mathstrut}^T \mathbf{w}_{\mathstrut}^{(c)} \end{equation}To assign a point in one of our current regions we seek out the classifier which maximizes this quantity
\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}This is the same rule we saw previously!
# try examining a point and its distance to relevant decision boundaries
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])
Here we cannot argue - as we did before - that one classifier is more 'confident' than the others.
But we can find the one that is the least 'unsure' about by assigning a point Not to the boundary it is furthest from, but the one it is closest to
# color the region on which all classifiers are negative
demo1.show_fusion(region = 3)
We can formalize this rule by noting that - once again - our reasoning has led us to assign a point to the class whose boundary is at the largest signed distance from it. Every point in the region lies on the negative side of our classifiers and all signed distances are negative. Hence the shortest distance in magnitude is the largest signed distance, being the smallest (in magnitude) negative number.
\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}We have now deduced that the following single rule for assigning a label $y$ to a point $\mathbf{x}$ applies to the entire space of our problem
\begin{equation} y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}We call this the fusion rule - since it tells us precisely how to fuse our $C$ individual classifiers together to make a unified and consistent classification across the entire space of any dataset.
1: Input: multiclass dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ where $y_{p}\in\left\{ 1,2,...,C\right\}$, two-class classification scheme and optimizer
2: for
$\,\,c = 1...C$
3: form temporary labels $\tilde y_p = \begin{cases} +1 \,\,\,\,\,\,\text{if}\,\, y_p = c \\ -1 \,\,\,\,\,\,\text{if}\,\, y_p \neq c \end{cases}$
4: solve two-class subproblem on $\left\{ \left(\mathbf{x}_{p,}\,\tilde y_{p}\right)\right\} _{p=1}^{P}$ to find weights $w_0^{(c)}$ and $\mathbf{w}_{\mathstrut}^{(c)}$
5: normalize classifier weights as $w_0^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$ and $\mathbf{w}_{\mathstrut}^{(c)} \longleftarrow \frac{w_0^{(c)}}{\left\Vert \mathbf{w}_{\mathstrut}^{(c)} \right\Vert_2}$
6: end for
7: To assign a label $y$ to a point $\mathbf{x}$, apply the fusion rule: $y = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{\mathstrut}^T\mathbf{w}_{\mathstrut}^{(\,j)}$
# classify and color the entire space using our individual classifiers and the fusion rule
demo1.show_complete_coloring()
# load in dataset
data3 = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/4class_data.csv',delimiter = ',')
# create an instance of the ova demo
demo3 = superlearn.ova_illustrator.Visualizer(data3)
# visualize dataset
demo3.show_dataset()
Note: with this dataset that each class is not linearly separable from the remainder of the data. OvA works nonetheless.
# solve the 2-class subproblems
demo3.solve_2class_subproblems()
# classify and color the entire space using our individual classifiers and the fusion rule
demo3.show_complete_coloring()
Taking the input of the $p^{th}$ point $\left(\mathbf{x}_p,\,y_p\right)$ we use the fusion rule to produce a predicted output $\hat y_p$ as
\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}We can then write the total number of misclassifications on our training set as
\begin{equation} \text{number of misclassifications on training set } = \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}with the accuracy being then calculated as
\begin{equation} \text{accuracy of learned classifier} = 1 - \frac{1}{P} \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}