In this post we discuss a popular alternative to OvA multi-class classification - detailed in the previous post - where we also learn $C$ two-class classifiers (and also employ the fusion rule) but train them simultaneously instead of independently as with OvA. This approach is the direct generalization of two-class classification to the multi-class setting, and so like two-class classification can be derived from a number of perspectives. We will first use the perceptron perspective in our derivations, followed by a discussion of the logistic regression perspective. These discussions culminate in the description of a single cost function for multi-class classification that - when minimized properly - provides similar results to OvA. This cost function is widely used and goes by many names, e.g., multi-class perceptron, multi-class softmax classification, softmax regression, and multi-class logistic regression.

# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import superlearn_library as superlearn
import autograd.numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import gridspec

%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

1. Multi-class perceptron and logistic regression¶

As we have just seen, in the OvA framework we learn $C$ linear classifiers separately and fuse them afterwards to create a final assignment rule for the entire space. The popular alternative we describe here - a direct generalization of two-class perceptron classifier to the multi-class setting - determines the $C$ classifiers jointly by learning all of their parameters simultaneously using a single function based on the fusion rule. Given our previous discussion of OvA this is the most natural perspective to take (i.e., examining the classification problem 'from above' as opposed to thinking of it equivalently as a surface fitting nonlinear regression problem), although fundamentally one may equally well begin from the surface fitting logistic regression perspective to derive precisely what we end up with here (just as in the two-class case).

Because we have already done the work in deriving the fusion rule we can derive the basic concept and cost function almost immediately. We will then see - as with our two-class classification perceptron - that it is common-place to 'soften' the initial cost function to allow for more stable optimization using the softmax approximation. This leads to a cost function that also arises multi-class logistic regression.

1.1 The multi-class perceptron¶

Once again we deal with an arbitrary multi-class dataset $\left\{ \left(\mathbf{x}_{p,}\,y_{p}\right)\right\} _{p=1}^{P}$ consisting of $C$ distinct classes of data with label values $y_{p}\in\left\{ 1,2,...,C\right\} $. In our previous post on OvA multiclass classification we saw how the fusion rule rightfully defined class ownership, partitioning the input space of a dataset given its classes in a fair way. In particular the fusion rule gives us predicted labels for our dataset, the $p^{th}$ of which $\hat y_p$ given as

\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

where the normal vectors here are assumed to all have unit length, i.e., $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$.

For the $p^{th}$ point, ideally we want this prediction to match its true label, i.e., $\hat y_p = y_p$, so that we have

\begin{equation} w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)} = \underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \end{equation}

Remember geometrically this simply says that the (signed) distance from the point $\mathbf{x}_p$ to its class decision boundary is greater than its distance from every other class's. This is what we want for all of our training datapoints - to have the prediction provided by the fusion rule match our given labels, and hence the above to be true for every point $\mathbf{x}_p$.

Now, suppose we have not yet tuned our $C$ two-class classifiers optimally but keep the length of all $C$ normal vectors fixed at $1$ so that we can reliably compute and compare distances to each boundary across the input space. We still know the sort of relationship that we want our set of $C$ two-class classifiers to have when optimally tuned.

By definition - when these weights are are tuned optimally - the right hand side of equation (2) must always be greater than or equal to the left, since the $c^{th}$ classifier is considered in the maximum on the right hand side. This means that their difference - subtracting $ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}$ from both sides - is always greater than or equal to zero

\begin{equation} \left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \geq 0 \end{equation}

Note here how strict equality occurs when we have what we want - where the weights $w_0^{(y_p)}$ and $\mathbf{w}_{\mathstrut}^{(y_p)}$ have been tuned correctly so that indeed the $y_p^{th}$ classifier provides the largest evaluation of $\mathbf{x}_p$. And we want equality to occur for all of our datapoints, that is we want their sum - written as follows - to equal zero (provided the weights of all $C$ classifiers are tuned correctly)

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C}{\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] \end{equation}

We call this the multi-class perceptron cost due to the fact that it is a direct generalization of the two-class perceptron. We show in the appendix that when $C = 2$, this cost function reduces to the two-class perceptron. We also prove its convexity in the appendix section of this post.

The multiclass perceptron cost is a direct generalization of the two-class perceptron, and is convex likewise.

Note how in minimizing this cost we should - at least formally - subject it to the constraints that all normal vectors are unit length, i.e., $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$. Remember - we need these normal vectors to be unit length if we are to fairly compare the distance of each input $\mathbf{x}_p$ to our two-class decision boundaries.

Thus, we have a proper constrained optimization problem on our hands that formally looks like

\begin{equation} \begin{aligned} \underset{w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,...,\,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}}{\,\,\,\,\,\,\,\,\,\,\,\,\mbox{minimize}\,\,\,} & \,\,\,\, \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C} {\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right]\\ \mbox{subject to}\,\,\, & \,\,\,\,\, \left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1, \,\,\,\,\,\, j=1,...,C \end{aligned} \end{equation}

This is a proper cost function for determining proper weights for our $C$ classifiers: it is always nonnegative, we want to find weights so that its value is small, and it is precisely zero when all training points are classified correctly. We can solve such problems directly in a variety of ways - e.g., by using projected gradient descent - but it is more commonplace to see this problem approximately solved by relaxing the constraints.

This means we can phrase this constrained problem in an unconstrained regularized form by relaxing the constraints but penalizing their magnitude. Because we have $C$ constraints we could in theory provide a distinct penalty (or regularization parameter) $\lambda_j$ for each constraint. However for simplicity one can choose a single regularization parameter $\lambda \geq 0$ that is used to penalize the magnitude of all normal vectors simultaneously. This way we need only provide one regularization value instead of $C$ distinct regularization parameters, giving the regularized form of the above problem as

\begin{equation} \underset{w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)}}{\text{minimize}} \,\, \sum_{p = 1}^P \left[\,\left(\underset{j=1,...,C} {\text{max}} \,\,\,w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)} \right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] + \lambda \sum_{j = 1}^{C} \left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 \end{equation}

whose objective is once again convex. This regularized form does not quite match the original constrained formulation as regularizing all normal vectors together will not necessarily guarantee that $\left \Vert \mathbf{w}_{\mathstrut}^{(\,j)} \right \Vert_2^2 = 1$ for all $j$. However it will generally force the length of all normal vectors to behave well, e.g., disallowing one normal vector to grow arbitrarily large while one shrinks to almost nothing. As we see many times in machine learning, it is commonplace to make such compromises to get something that is 'close enough' to the original as long as it does work well in practice. This is indeed the case here with $\lambda$ typically set to a small value (e.g., $10^{-3}$ and smaller).

Example 1: Multi-class perceptron¶

In this example we minimize the regularized multi-class classifier defined above over a toy dataset with $C=3$ classes used in deriving OvA in the previous post.

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# visualize dataset
demo.show_dataset()

Next, we parse out the input from output, and form the regularized cost function.

# define the input and output of our dataset
x = np.asarray(data[:,:-1])
x.shape = (len(x),np.shape(data)[1]-1); x = x.T;
y = data[:,-1]
y.shape = (len(y),1)

One is free to implement the cost function here in a number of ways. Note however that in the particular implementation shown here the weights from all $C$ classifiers are input as an $N + 1$ by $C$ array of the form

\begin{equation} \mathbf{W}=\left[\begin{array}{cccc} w_{0}^{(1)} & w_{0}^{(2)} & \cdots & w_{0}^{(C)}\\ \mathbf{w}^{(1)} & \mathbf{w}^{(2)} & \cdots & \mathbf{w}^{(C)} \end{array}\right] \end{equation}

where the bias and normal vector of the $c^{th}$ classifier have been stacked on top of one another and made the array's $c^{th}$ column. Also note that the quantity $w_0^{(\,j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(\,j)}$ is pre-computed for all points prior to the summation and stored in a single array.

# multiclass perceptron regularized by the summed length of all normal vectors
lam = 10**-4  # our regularization paramter 
def multiclass_perceptron(W):        
    # pre-compute predictions on all points
    all_evals = W[0,:] + np.dot(x.T,W[1:,:])

    # compute counting cost
    cost = 0
    for p in range(len(y)):
        # pluck out current true label
        y_p = int(y[p][0]) - 1    # subtract off one due to python indexing

        # update cost summand
        cost +=  max(all_evals[p,:]) - all_evals[p,y_p]
        
        # return cost with regularizer added
    return cost + lam*np.linalg.norm(W[1:,:],'fro')**2

Since the cost is convex we can use the unnormalized form of gradient descent to minimize it.

# declare an instance of our current our optimizers
opt = superlearn.optimimzers.MyOptimizers()

# run desired algo with initial point, max number of iterations, etc.,
W_init = np.random.randn(3,4)
w_hist = opt.gradient_descent(g = multiclass_perceptron,w = W_init,version = 'unnormalized',max_its = 200, alpha = 10**-2)

With our multi-class classifier trained we now show how it classifies the entire input space. In the left panel we plot each individual two-class classifier. In the middle panel we show the fused multi-class decision boundary formed by combining these individual classifiers via the fusion rule. In the right panel we plot the cost function value over $200$ iterations of gradient descent.

Note in the left panel that because we did not train each individual classifier in an OvA sense - but trained them together all at once - each individual learned two-class classifier performs quite poorly. This is fine as our cost function aimed at minimizing all errors of every class simultaneously and not two at a time as with OvA - so we need not expect each individual classifier to cut the space well. However since they were learned together their combination - using the fusion rule - provides a multi-class decision boundary with zero errors.

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_perceptron)

1.2 The smooth softmax approximation to multi-class perceptron cost¶

As we saw previously, for instance with the two-class perceptron, we are often willing to sacrifice a small amount of modeling precision - forming a closely matching smoother cost function to the one we already have - in order to make optimization easier or expand the optimization tools we can bring to bear.

Recall from our discussion of the two-class perceptron that the softmax function

\begin{equation} \text{soft}\left(s_{1},...,s_{C}\right)=\text{log}\left(\sum_{j = 1}^{C} e^{s_{j}}\right) \end{equation}

is a close and smooth approximation to the maximum of $C$ scalar numbers $s_{1},...,s_{C}$, i.e.,

\begin{equation} \text{max}\left(s_{1},...,s_{C}\right) \approx \text{soft}\left(s_{1},...,s_{C}\right) \end{equation}

Replacing the max function in each summand of the multi-class perceptron cost in equation (4), we have the softmax multi-class cost function written as

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = \sum_{p = 1}^P \left[\text{log}\left( \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} \right) - \left( w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right)\right] \end{equation}

As with the mutli-class perceptron, the multi-class softmax cost also reduces to the two-class version when $C = 2$ (see the appendix of this post for further details). Furthermore, not only is the multi-class softmax cost function convex but - unlike the multi-class perceptron - it has infinitely many smooth derivatives, hence we can use Newton's method (in addition to gradient descent) in order to properly minimize it.

The multi-class softmax cost reduces to the two-class version when $C = 2$. Furthermore, not only is the multi-class softmax cost function convex but - unlike the multi-class perceptron - it has infinitely many smooth derivatives, hence we can use Newton's method (in addition to gradient descent) in order to properly minimize it.

One often sees this cost function go by many names - e.g., softmax regression and multi-class logistic regression - for as with two-class classification the softmax cost can be interpreted from a logistic regression (surface fitting) perspective. One also often sees the cost written in various ways in practice as well - sometimes due to the perspective taken in deriving it, or for numerical stability reasons. A very common alternative way one will see the softmax multi-class cost function written (when thought of from the perspective of logistic regression) can be seen by using basic properties of the log function. Using the following properties

\begin{equation} \text{log}\left(s\cdot t\right) = \text{log}\left(s\right) + \text{log}\left(t\right) \\ \text{log}\left(s\right)^{-1} = \text{log}\left(\frac{1}{s}\right) \end{equation}

one can express the softmax multi-class cost in equation (10) equivalently as

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},...,w_0^{(C)},\,\mathbf{w}_{\mathstrut}^{(C)} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}

Care must be taken when implementing any cost function - or mathematical expression in general - involving the exponential function $e^{\left(\cdot\right)}$ in Python. By nature the exponential function grows large very rapidly causing undesired 'overflow' issues even with moderate-sized exponents, e.g., $e^{1000}$. Large numbers like this cannot be stored explicitly on the computer and so are represented symbolically as $\infty$. Additionally, the division of two such large numbers - which is a potentiality when evaluating the summands of the multi-class cost in equation (12) - is computed and stored as a NaN (not a number) causing severe numerical stability issues.

One way we can guard against this issue is by normalizing the data - to fall within a relatively small range - and regularizing the weights, e.g., via their $\ell_2$ norm, to punish/prevent large weight values. While a workable solution, in practice we can still run into overflow issues immediately after initialization especially when input is high-dimensional. It is therefore good practice to implement one's own version of the exponential function by capping the maximum value it can take, as shown below where $G$ is set to a relatively large value that does not send $e^G$ to $\infty$.

G = 500

def my_exp(x, G):
    return np.exp(x) if x<G else np.exp(G)

Example 2: Multi-class softmax on a toy dataset with $C=3$ classes¶

In this example we run the multi-class softmax classifier on the same dataset used in the previous example, first using unnormalized gradient descent and then Newton's method. In the next Python cell we implement a version of the multi-class softmax cost function complete with regularizer. The weights are formatted precisely as in our implementation of the multi-class perceptron, discussed in Example 1.

# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**-3  # our regularization paramter 
def multiclass_softmax(W):        
    # pre-compute predictions on all points
    all_evals = W[0,:] + np.dot(x.T,W[1:,:])

    # compute counting cost
    cost = 0
    for p in range(len(y)):
        # pluck out current true label
        y_p = int(y[p][0]) - 1    # subtract off one due to python indexing

        # update cost summand
        cost +=  np.log(np.sum(np.exp(all_evals[p,:]))) - all_evals[p,y_p]
        
        # return cost with regularizer added
    return cost + lam*np.linalg.norm(W[1:,:],'fro')**2

Now we minimize this cost function using gradient descent - for $500$ iterations using a fixed steplength value $\alpha = 10^{-2}$.

# declare an instance of our current our optimizers
opt = superlearn.optimimzers.MyOptimizers()

# run desired algo with initial point, max number of iterations, etc.,
W_init = np.random.randn(3,3)
w_hist = opt.gradient_descent(g = multiclass_softmax,w = W_init,version = 'unnormalized',max_its = 500, alpha = 10**-2)

In the next cell we plot the final classification over the entire space in the left and middle panels while the cost function plot from our run of gradient descent is plotted in the right panel. In the left panel are shown the final learned two-class classifiers individually, in the middle the multi-class boundary created using these two-class boundaries and the fusion rule. As with the multi-class perceptron, since the multi-class softmax cost focuses on optimizing the parameters of all $C$ two-class classifiers simultaneously to get the best multi-class fit, each one of the two-class decision boundaries need not perfectly distinguish its class from the rest of the data.

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)

Optimizing using Newton's method takes just a few steps: in the next cell we re-run the above experiment only using 5 Newton steps.

# declare an instance of our current our optimizers
opt = superlearn.optimimzers.MyOptimizers()

# run desired algo with initial point, max number of iterations, etc.,
W_init = np.random.randn(3,3)
w_hist = opt.newtons_method(g = multiclass_softmax,w = W_init,max_its = 5)

Below we then print out the same panels as previously, only displaying the results of Newton's method. Using just a few steps we reach a far lower point on the multi-class softmax function - as can be seen by comparing the right panel below with the one shown previously with gradient descent.

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)

Example 3: Multi-class classification from a regression perspective¶

As we saw previously in our post on logistic regression, two-class classification can be viewed from two different but equally important perspectives. The first perspective is one of regression or surface fitting where by keeping the label dimension we fit a step-like function to the data in the problem's input-output space. A second perspective on classification is when we discard the label dimension and try to separate out the points solely in their input space. The latter, which we called the perceptron view, is what we have adopted so far to study multi-class classification.

It should come as little surprise that one can also view multi-class classification from a regression (or surface fitting) perspective - only this time the step-like function to be fit to the data has more than two levels or steps, each defining one of the classes. In the Python cell below we show both perspectives side-by-side using our toy $C=3$ class dataset.

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_surface_fit(w_hist,view = [15,115])

Example 4: Multi-class softmax on a toy dataset with $C = 4$ classes¶

Here we quickly apply Newton's method to fit the multi-class softmax cost to our toy dataset with $C=4$ classes.

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/4class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# visualize dataset
demo.show_dataset()

# define the input and output of our dataset
x = np.asarray(data[:,:-1])
x.shape = (len(x),np.shape(data)[1]-1); x = x.T;
y = data[:,-1]
y.shape = (len(y),1)

In the next cell we run Newton's method for 5 iterations.

# declare an instance of our current our optimizers
opt = superlearn.optimimzers.MyOptimizers()

# run desired algo with initial point, max number of iterations, etc.,
W_init = np.random.randn(3,4)
w_hist = opt.newtons_method(g = multiclass_softmax,w = W_init,max_its = 5)

Finally, we plot our results, as in the previous Example.

# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right panel
demo.show_complete_coloring(w_hist,show_cost = True, cost = multiclass_softmax)

Example 5: Comparing cost function and counting cost values¶

Below we compare the number of misclassifications versus the multi-class softmax cost with regularizer. In this instance $\lambda = 10^{-3}$ for three runs of unnormalized gradient descent using a steplength parameter $\alpha = 10^{-2}$ for all three runs.

# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3class_data.csv',delimiter = ',')

# create an instance of the ova demo
demo = superlearn.multiclass_illustrator.Visualizer(data)

# run demo
demo.compare_to_counting(num_runs = 3)

1.4 Counting misclassifications and the accuracy of a multi-class classifier¶

We do this the same way we do with OvA.

Once trained we can compute predicted labels for our training set by simply evaluating each input via the fusion rule. Taking the input of the $p^{th}$ point $\left(\mathbf{x}_p,\,y_p\right)$ we use the fusion rule to produce a predicted output $\hat y_p$ as

\begin{equation} \hat y_p = \underset{j=1,...,C}{\text{argmax}} \,\,\,w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)} \end{equation}

and then simply compare the actual and predicted labels to understand if our prediction is correct. One way to write this comparison is via $\left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p} \right) \right |$, which is $0$ if the prediction matches the true label and $+1$ otherwise. We can then write the total number of misclassifications on our training set as

\begin{equation} \text{number of misclassifications on training set } = \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}

with the accuracy being then calculated as

\begin{equation} \text{accuracy of learned classifier} = 1 - \frac{1}{P} \sum_{p = 1}^{P} \left | \text{sign}\left(\hat y_p - \overset{\mathstrut}{y_p}\right) \right | \end{equation}

1.5 Multi-class softmax from a probabilistic perspective¶

The probabilistic derivation of the multi-class softmax cost is very similar to that of the two-class softmax, discussed previously in our post on logistic regression. Once again, the data independence assumption allows us to simplify the joint likelihood as

\begin{equation} {\cal L}=\prod_{p=1}^{P}{\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right) \end{equation}

where the matrix $\mathbf{W}$ is the collection of all classifier weights as defined in equation (7). Even though $y_p$ here takes one of $C>2$ classes, we can still use the same geometric argument made when we had only two classes in order to propose a functional form for ${\cal P}\left(y=y_{p}\,|\,\mathbf{x}_{p},\mathbf{W}\right)$. Specifically, we again connect the probability of a point $\mathbf{x}_p$ belonging to a certain class to the signed distance from $\mathbf{x}_p$ to that class' decision boundary. That is, the larger the (signed) distance the deeper into that class' subspace $\mathbf{x}_p$ lies, and hence the more confident we are about $\mathbf{x}_p$ belonging to that class.

Assuming all classifiers' normal vectors are normalized to have unit length, the signed distance from $\mathbf{x}_p$ to the $c^{th}$ decision boundary $w_0^{(c)}+\mathbf{x}^T\mathbf{w}^{(c)}=0$ is simply given by the evaluation of the decision boundary at $\mathbf{x}_p$, i.e., $w_0^{(c)}+\mathbf{x}_p^T\mathbf{w}^{(c)}$.

Adopting our geometric argument we want to have

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)} \end{equation}

or put into words, that the probability of $\mathbf{x}_p$ belonging to class $c$ to be proportional to its signed distance to the $c^{th}$ decision boundary. These signed distances however can be negative and hence cannot be used immediately as class probabilities. We can resolve this issue by passing them through an always-positive and monotonically-increasing function such as $e^{\left(\cdot \right)}$ to get

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}} \end{equation}

One issue still remains and that is $e^{w_0^{(c)}+\mathbf{x}_p^T\mathbf{w}^{(c)}}$ can potentially be larger than one and hence not a valid probability. Luckily there is a simple fix for this issue as well: divide all values by $\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}$

\begin{equation} {\cal P}\left(y=1\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(1)}+\mathbf{x}_p^T\mathbf{w}^{(1)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ {\cal P}\left(y=2\,|\,\mathbf{x}_{p},\mathbf{W}\right) \propto \frac{e^{w_0^{(2)}+\mathbf{x}_p^T\mathbf{w}^{(2)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}}\\ \vdots\\ {\cal P}\left(y=C\,|\,\mathbf{x}_{p},\mathbf{W}\right)\propto \frac{e^{w_0^{(C)}+\mathbf{x}_p^T\mathbf{w}^{(C)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}

Notice that all right hand side values now lie between $0$ and $1$, and moreover, they all add up to $1$ making them mathematically valid values to represent the class probabilities. Plugging

\begin{equation} {\cal P}\left(y=y_p\,|\,\mathbf{x}_{p},\mathbf{W}\right) = \frac{e^{w_0^{(y_p)}+\mathbf{x}_p^T\mathbf{w}^{(y_p)}}}{\sum_{j=1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}}} \end{equation}

into the likelihood in equation (16) and taking the negative logarithm of the resulting function gives the multi-class negative log-likelihood loss

\begin{equation} g\left(\mathbf{W} \right) = -\sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }{ \sum_{j = 1}^{C} e^{ w_0^{(j)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(j)}} }\right) \end{equation}

to be minimized. It should come as no surprise that this is precisely the multi-class softmax cost we derived previously in equation (12).

1.6 Which multi-class classification scheme works best?¶

As we have now seen, both OvA and multi-class softmax approaches are built using the fusion rule given in equation (1). While the multi-class softmax approach more directly aims at optimizing this criterion, both OvA and multi-class softmax perform similarly well in practice.

One-versus-all (OvA) and multi-class softmax classifiers perform similarly well in practice, having both been built using the fusion rule in equation (1).

The two methods largely differ in how they are applied in practice as well as their computational burden. In learning each of the $C$ linear separators individually the computation required for the OvA classifier is naturally parallelizable, as each linear separator can be learned independently of the rest. On the other hand, while both OvA and multi-class softmax may be naturally extended for use with nonlinear multi-class classification (as we will see in a future post), the multi-class softmax scheme provides a more commonly used framework for performing nonlinear multi-class classification using neural networks.

Appendix¶

Multi-class perceptron reduces to two-class perceptron when $C = 2$¶

When $C=2$ the multi-class perceptron cost in equation (4) reduces to

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)} \right) = \sum_{p = 1}^P \left[\,{\text{max}} \left(w_0^{(1)} + \mathbf{x}_{p}^T\mathbf{w}^{(1)},\,w_0^{(2)} + \mathbf{x}_{p}^T\mathbf{w}^{(2)}\right) - \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}\right) \right] \end{equation}

which, using the following equality for any real values $a$, $b$, and $c$

\begin{equation} {\text{max}} \left(a,b\right) - c = {\text{max}} \left(a-c,b-c\right) \end{equation}

can be written equivalently as

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)} \right) = \sum_{p = 1}^P {\text{max}} \left(w_0^{(1)} + \mathbf{x}_{p}^T\mathbf{w}^{(1)}- \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}^{(y_p)}\right), \,w_0^{(2)} + \mathbf{x}_{p}^T\mathbf{w}^{(2)}- \left(w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}^{(y_p)}\right)\right) \end{equation}

Grouping the summands according to their labels, we have

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)} \right) = \sum_{p:\, y_p = 1} {\text{max}} \left(0, \,w_0^{(2)}-w_0^{(1)} + \mathbf{x}_{p}^T\left(\mathbf{w}^{(2)}-\mathbf{w}^{(1)}\right)\right)\\ + \sum_{p:\, y_p = 2} {\text{max}} \left(0, \,w_0^{(1)}-w_0^{(2)} + \mathbf{x}_{p}^T\left(\mathbf{w}^{(1)}-\mathbf{w}^{(2)}\right)\right) \end{equation}

Re-assigning the labels $y_p=1 \rightarrow y_p=-1$ and $y_p=2 \rightarrow y_p=+1$ to match the label values we used in deriving the two-class perceptron, we can write

\begin{equation} g\left(w_0^{(-1)},\,\mathbf{w}_{\mathstrut}^{(-1)},\,w_0^{(+1)},\,\mathbf{w}_{\mathstrut}^{(+1)} \right) = \sum_{p:\, y_p = -1} {\text{max}} \left(0, \,w_0^{(+1)}-w_0^{(-1)} + \mathbf{x}_{p}^T\left(\mathbf{w}^{(+1)}-\mathbf{w}^{(-1)}\right)\right) \\+ \sum_{p:\, y_p = +1} {\text{max}} \left(0, \,w_0^{(-1)}-w_0^{(+1)} + \mathbf{x}_{p}^T\left(\mathbf{w}^{(-1)}-\mathbf{w}^{(+1)}\right)\right) \end{equation}

Letting $w_0=w_0^{(+1)}-w_0^{(-1)}$ and $\mathbf{w}=\mathbf{w}^{(+1)}-\mathbf{w}^{(-1)}$, the above can be written as

\begin{equation} g\left(w_0,\mathbf{w} \right) = \sum_{p:\, y_p = -1} {\text{max}} \left(0, \,w_0 + \mathbf{x}_{p}^T\mathbf{w}\right) + \sum_{p:\, y_p = +1} {\text{max}} \left(0, \,-w_0 - \mathbf{x}_{p}^T\mathbf{w}\right) \end{equation}

This can be written more compactly to arrive at the familiar two-class perceptron cost function

\begin{equation} g\left(w_0,\mathbf{w} \right) = \sum_{p=1}^P {\text{max}} \left(0, -y_p\left(w_0 + \mathbf{x}_{p}^T\mathbf{w}\right)\right) \end{equation}

Multi-class logistic regression reduces to two-class logistic regression when $C = 2$¶

When $C=2$ the multi-class softmax cost in equation (12) reduces to

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)} \right) = \sum_{p = 1}^P \text{log}\left(\frac{ e^{ w_0^{(1)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(1)}} +e^{ w_0^{(2)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(2)}} }{ e^{ w_0^{(y_p)} + \mathbf{x}_{p}^T\mathbf{w}_{\mathstrut}^{(y_p)}} }\right) \end{equation}

Grouping the summands according to their labels, we have

\begin{equation} g\left(w_0^{(1)},\,\mathbf{w}_{\mathstrut}^{(1)},\,w_0^{(2)},\,\mathbf{w}_{\mathstrut}^{(2)} \right) = \sum_{p:\, y_p = 1} \text{log}\left( 1 + e^{w_0^{(2)}-w_0^{(1)} + \mathbf{x}_{p}^T \left(\mathbf{w}^{(2)}-\mathbf{w}^{(1)}\right)} \right) \\+ \sum_{p:\, y_p = 2} \text{log}\left( 1 + e^{w_0^{(1)}-w_0^{(2)} + \mathbf{x}_{p}^T \left(\mathbf{w}^{(1)}-\mathbf{w}^{(2)}\right)} \right) \end{equation}

Re-assigning the labels $y_p=1 \rightarrow y_p=-1$ and $y_p=2 \rightarrow y_p=+1$ to match the label values we used in deriving the two-class softmax cost previously, we can write

\begin{equation} g\left(w_0^{(-1)},\,\mathbf{w}^{(-1)},\,w_0^{(+1)},\,\mathbf{w}_{\mathstrut}^{(+1)} \right) = \sum_{p:\, y_p = -1} \text{log}\left( 1 + e^{w_0^{(+1)}-w_0^{(-1)} + \mathbf{x}_{p}^T \left(\mathbf{w}^{(+1)}-\mathbf{w}^{(-1)}\right)} \right) \\+ \sum_{p:\, y_p = +1} \text{log}\left( 1 + e^{w_0^{(-1)}-w_0^{(+1)} + \mathbf{x}_{p}^T \left(\mathbf{w}^{(-1)}-\mathbf{w}^{(+1)}\right)} \right) \end{equation}

Letting $w_0=w_0^{(+1)}-w_0^{(-1)}$ and $\mathbf{w}=\mathbf{w}^{(+1)}-\mathbf{w}^{(-1)}$, the above can be written as

\begin{equation} g\left(w_0,\mathbf{w} \right) = \sum_{p:\, y_p = -1} \text{log}\left( 1 + e^{w_0 + \mathbf{x}_{p}^T\mathbf{w}} \right) + \sum_{p:\, y_p = +1} \text{log}\left( 1 + e^{-w_0 - \mathbf{x}_{p}^T \mathbf{w}} \right) \end{equation}

This can be written more compactly to arrive at the familiar two-class softmax cost function

\begin{equation} g\left(w_0,\mathbf{w} \right) = \underset{p=1}{\overset{P}{\sum}}\mbox{log}\left(1+e^{-y_{p}\left(w_0+\mathbf{x}_{p}^{T}\mathbf{w}\right)}\right) \end{equation}

Multi-class perceptron and softmax costs are convex¶

Notice we always have that:

I. Addition of two (or more) convex functions is always convex.

II. Linear and affine functions are convex.

III. The max, exponential, and negative logarithm functions are all convex.

IV. Composition of two convex functions remains convex.

Each of the statements above can be verified easily using the following definition of convexity:

A function g is convex if and only if for all $\mathbf{w}_1$ and $\mathbf{w}_2$ in the domain of g and all $\lambda \in \left[0, 1\right]$, we have

\begin{equation} g\left(\lambda \mathbf{w}_1+\left(1-\lambda\right) \mathbf{w}_2\right)\leq \lambda g\left(\mathbf{w}_1\right)+\left(1-\lambda\right) g\left(\mathbf{w}_2\right) \end{equation}

With these four statements at hand, it is straight-forward to prove convexity of multi-class perceptron and softmax cost functions.