Nonlinear supervised learning

Press the button 'Toggle code' below to toggle code on and off for entire this presentation.

In [11]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is eåxported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)

1.1 From linear to non-linear modeling

  • Thus far we have covered the basics of linear modeling for regression and classification.
  • With regression we started with a desire that a dataset could be represented linearly via the model $w_0 + \mathbf{x}_p^T\mathbf{w}_{\,}^{\,} \approx y_p$
  • This desire led to the Least Squares cost function for linear regression, which minimized provokes our desire to hold as well as possible.
  • So, in other words, ideally we want all our data to lie on some hyperplane $w_0^{\,} + \mathbf{x}_{\,}^T\mathbf{w}_{\,}^{\,} = y_{\,}^{\,}$ in the input/output space
  • So our ideal dataset is a bunch of points - with no noise - from a hyperplane

  • With classification we started with a desire that a dataset could be represented linearly the step function with linear boundary
  • A hyperplane in the input space $w_0 + \mathbf{x}_{\,}^T\mathbf{w}_{\,}^{\,} = 0$ where
\begin{equation} \begin{cases} w_0+\mathbf{x}_{p}^{T}\mathbf{w}^{\,} >0 & \,\,\,\,\text{if} \,\,\, y_{p}=+1 \\ w_0+\mathbf{x}_{p}^{T}\mathbf{w}^{\,} < 0 & \,\,\,\,\text{if} \,\,\, y_{p}=-1 \end{cases} \end{equation}
  • In other ideally we want data to lie perfectly on a step function $\text{sign}\left(w_0 + \mathbf{x}_{\,}^T\mathbf{w}_{\,}^{\,} \right) = y$

The generalization of this notion for regression

  • for regression: our data lies perfectly on a nonlinear hyperplane surface
  • instead of hyperplane $w_0^{\,} + \mathbf{x}_{\,}^T\mathbf{w}_{\,}^{\,} = y_{\,}^{\,}$ we have some general nonlinear function $f\left(\mathbf{x},\mathbf{w}\right) = y$
  • as with linear regression we do not expect perfection!

The generalization of this notion for classification

  • for classification: our data lies perfectly on step function with linear nonlinear boundary
  • instead of separating hyperplane $w_0^{\,} + \mathbf{x}_{\,}^T\mathbf{w}_{\,}^{\,} = 0$ we have some general nonlinear seperator $f\left(\mathbf{x},\mathbf{w}\right) = 0$
  • as with linear classification we do not expect perfection!

1.2 'Simple' examples of nonlinear regression and classification

  • For some low-dimensional datasets we could visualize and reason out a proper nonlinear function / separator by eye
  • In some classical cases from e.g., physics and economics one can reason out the form of a regressor via philisophical arguments
  • (This is what differential equations and dynamical systems are all about)
  • In general cannot rely on either our eyes or understanding in general to provide nonlinear form
  • Most datasets too high dimensional to visualize, and mechanics lie beyond our comprehension
  • So in general we need something more robust - this is where machine learning comes in!
  • Tools like neural networks help automate the search for proper nonlinear regressors / boundaries (as do trees and kernels)
  • First lets examine some 'easy' cases where we can make a good guess ourselves
  • Some important principles, formulations, jargon, etc., from these simple cases carry over to neural networks, trees, kernels, etc.,

'Simple' nonlinear regression examples

Example 1. A sinusoidal dataset

In [20]:
# create instance of linear regression demo, used below and in the next examples
demo1 = nonlib.nonlinear_regression_visualizer.Visualizer(csvname = datapath + 'noisy_sin_sample.csv')
demo1.plot_data(xlabel = 'x',ylabel = 'y')
  • Looks sinusoidal, i.e., like
\begin{equation} w_0 + w_1\text{sin}\left(w_2 + w_3x_p\right) \approx y_p \end{equation}

could fit the data well if parameters are tuned

  • This would give us a nonlinear regressor or predictor of the form
\begin{equation} \text{predict}(x) = w_0 + w_1\text{sin}\left(w_2 + w_3x\right) \end{equation}
  • How do we fit? As in the linear case we want the above to hold on our data lets square their differences, and minimize a Least Squares cost
  • Summing over all $P$ datapoints gives the Least Squares cost function
\begin{equation} g\left(w_0, w_1, w_3, w_4\right) = \sum_{p=1}^P \left(w_0 + w_1\text{sin}\left(w_2 + w_3x_p\right) - y_p\right)^2 = \sum_{p = 1}^P \left(\text{predict}\left(x_p\right) - y_p\right)^2 \end{equation}
  • Minimize this by (normalized) gradient descent
  • In Python implementation we break the cost into modular pieces
    • the sine nonlinearity --> the prediction --> the cost function
In [21]:
# nonlinearity
def f(x_val,w):
    # create feature
    f_val = np.sin(w[2] + w[3]*x_val)
    return f_val

# prediction
def predict(x_val,w):
    # linear combo
    val = w[0] + w[1]*f(x_val,w)
    return val

# least squares
def least_squares(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p]
        y_p = y[p]
        cost +=(predict(x_p,w) - y_p)**2
    return cost

Now we can minimize the cost using gradient descent

We can then plot the resulting fit to the data

In [25]:
# static transform image
demo1.static_img(w_best,least_squares,predict)
  • Some common jargon to familiarize yourself with: we have already seen (now all weights are tuned)
\begin{equation} \text{predict}(x) = w_0 + w_1\text{sin}\left(w_2 + w_3x\right) \end{equation}
  • In particular the nonlinear transformation of $x$ is called a feature transformation (we wrote this explicitly in code)
\begin{equation} \text{feature transformation}: f(x) = \text{sin}\left(w_2 + w_3x\right) \end{equation}
  • We can write the predictor as
\begin{equation} \text{predict}(x) = w_0 + w_1\,f(x) \end{equation}
  • We can write the predictor as
\begin{equation} \text{predict}(x) = w_0 + w_1\,f(x) \end{equation}
  • In other words, the predictor is a linear combination of the feature transform $f(x)$
  • Thus in the transformed feature space with input horizontal axis $f(x)$ and vertical axis $y$ our sinusoidal fit is linear
In [26]:
# static transform image
demo1.static_img(w_best,least_squares,predict,f1_x = [f(v,w_best) for v in x])
  • This is true in general

A properly designed feature (or set of features) provides a good nonlinear fit in the original feature space and, simultaneously, a good linear fit in the transformed feature space.

Example 2. A population growth example

  • Next we examine a population growth dataset - of Yeast cells growing in a constrained enviroment (source).

  • Both input and output normalized

In [16]:
# create instance of linear regression demo, used below and in the next examples
demo2 = nonlib.nonlinear_regression_visualizer.Visualizer(csvname = datapath + 'yeast.csv')

# plot dataset
demo2.plot_data(xlabel = 'time', ylabel = 'yeast population')
  • What does this look like?
  • Certainly looks like a sigmoidal regressor would do the job well here
\begin{equation} \text{predict}(x) = w_0 + w_1\text{tanh}\left(w_2 + w_3x\right) \end{equation}
  • Here $f(x) = \text{tanh}\left(w_2 + w_3x\right)$ is our feature transformation, and we minimize the Least Squares error
\begin{equation} g\left(w_0, w_1, w_3\right) = \sum_{p = 1}^P \left(\text{predict}\left(x_p\right) - y_p\right)^2 \end{equation}
  • Once weights tuned properly this provides linear fit in transformed feature space
In [23]:
# nonlinearity
def f(x,w):
    # shove through nonlinearity
    c = np.tanh( w[2] + w[3]*x)
    return c

# prediction
def predict(x,w):
    # linear combo
    val = w[0] + w[1]*f(x,w)
    return val

# least squares
def least_squares(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p]
        y_p = y[p]
        cost +=(predict(x_p,w) - y_p)**2
    return cost
  • Center data and minimize via normalized gradient descent
In [26]:
# static transform image
demo2.static_img(w_best,least_squares,predict,f1_x = [f(v,w_best) for v in x])
  • On the right - data in transformed feature space (note again all weights are now tuned)

Example 3. Galileo's ramp experiment

  • In 1638 Galileo wants to understand how gravity acts on objects quantify the relationship between
  • In particular: how does an object's distance from above ground affect the time it takes to hit the ground when dropped
  • Sets up the following experiment

Figure 1: Figurative illustration of Galileo's ramp experiment setup used for exploring the relationship between time and the distance an object falls due to gravity. To perform this experiment he repeatedly rolled a ball down a ramp and timed how long it took to get $\frac{1}{4}$,$\frac{1}{2}$, $\frac{2}{3}$, $\frac{3}{4}$, and all the way down the ramp.

  • Why didn't he just drop the ball and time how long it took to get $\frac{1}{4}$, $\frac{1}{2}$, etc., to the ground?
  • Reliable clocks didn't exist yet! (he used a water clock.) Galielo's later study of pendulums led to first realiable clock
In [17]:
# create instance of linear regression demo, used below and in the next examples
demo3 = nonlib.nonlinear_regression_visualizer.Visualizer(csvname = datapath + 'galileo_ramp_data.csv')

# plot dataset
demo3.plot_data(xlabel = 'time (in seconds)',ylabel = 'portion of ramp traveled')
  • Clearly a nonlinear relationship - Galileo intuitied a quadratic one
  • In other words, that for some $w_0$, $w_1$, and $w_2$ that
\begin{equation} \text{predict}(x) = w_0 + w_1x + w_2x^2 \end{equation}
  • Here we have 2 feature transformations: the identity $f_1(x) = x$ and $f_2(x) = x^2$ so can write
\begin{equation} \text{predict}(x) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) \end{equation}
  • Recover weights by minimizing Least Squares error
\begin{equation} g\left(w_0, w_1, w_3\right) = \sum_{p = 1}^P \left(\text{predict}\left(x_p\right) - y_p\right)^2 \end{equation}
In [28]:
# feature transformation
def f1(x):
    return x

def f2(x):
    return x**2
    
# prediction
def predict(x,w):
    # linear combo
    a = w[0] + w[1]*f1(x) + w[2]*f2(x)
    return a

# least squares
def least_squares(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p]
        y_p = y[p]
        cost +=(predict(x_p,w) - y_p)**2
    return cost

And we optimize using e.g., (unnormalized) gradient descent

In [30]:
# static transform image
f1_x = [f1(v) for v in x]; f2_x = [f2(v) for v in x]
demo3.static_img(w_best,least_squares,predict,f1_x=f1_x,f2_x=f2_x,view = [35,100])
  • Here the linear fit is in one dimension higher than where we started!
  • True more generally speaking: the more feature transforms we use the higher up we go!

'Simple' nonlinear classification examples

Example 4. A one dimensional example

  • With linear regression we started with a desire that a dataset could be represented linearly via the model $w_0 + {w}_{1}^{\,}{x}_p \approx y_p$
  • With logistic regression we started with a desire that a dataset could be represented by shoving line through sigmoid as in $\text{tanh}\left(w_0 + {w}_{1}^{\,}{x}_p\right) \approx y_p$
  • Once tuned our decision boundary $w_0 + w_1x = 0$, our decision boundary
  • Our predictor is then
\begin{equation} \text{predict}(x) = w_0 +_{\,} w_1x_{\,} \end{equation}
  • If $\text{predict}(x) > 0$ then $x$ assigned to $+1$ class, if $\text{predict}(x) < 0$ assigned to $-1$ class

  • Tune by minimizing e.g., softmax cost
\begin{equation} g\left(w_0,w_1\right) = \sum_{p=1}^{P} \text{log}\left(1 + e^{-y_p\left(w_0 + w_1x_p\right)} \right) = \sum_{p=1}^{P} \text{log}\left(1 + e^{-y_p\, \text{predict}_{}\left(x_p\right)} \right) \end{equation}
  • What about this dataset?
In [18]:
# create instance of linear regression demo, used below and in the next examples
demo4 = nonlib.nonlinear_classification_visualizer.Visualizer(csvname = datapath + 'signed_projectile.csv')

# plot dataset
demo4.plot_data()
  • How about a quadratic decision boundary: $w_0 + w_1x^{\,} + w_2x^2 = 0$
  • If we denote
\begin{equation} \text{predict}(x) = w_0 + w_1x^{\,} + w_2x^2 \end{equation}

then once tuned if $\text{predict}(x) > 0$ $x$ in class $+1$, if $\text{predict}(x)<0$ class $-1$

  • Here we have two feature transformations (we will write explicitly in code)
\begin{equation} \text{feature transformations}: f_1(x) = x \,\,\,\,\,\, f_2(x) = x^2 \end{equation}
  • So we can write our predictor
\begin{equation} \text{predict}(x) = w_0 + w_1\,f_1(x) + w_2\,f_2(x) \end{equation}

  • Tune weights by minimizing e.g., the softmax cost: $\,g\left(w_0,w_1,w_2\right) = \sum_{p=1}^{P} \text{log}\left(1 + e^{-y_p\, \text{predict}_{}\left(x_p\right)} \right)$
In [31]:
# feature transforms
def f1(x):
    return x

def f2(x):
    return x**2
    
# prediction
def predict(x,w):
    # linear combo
    val = w[0] + w[1]*f1(x) + w[2]*f2(x)
    return val

# softmax cost
def softmax(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p]
        y_p = y[p]
        cost += np.log(1 + np.exp(-y_p*predict(x_p,w)))
    return cost
In [35]:
# static transform image
demo4.static_N1_img(w_best,least_squares,predict,f1_x = [f1(s) for s in x], f2_x = [f2(s) for s in x],view = [25,15])
  • A nonlinear decision boundary in the original space is linear in the transformed feature space!
  • A nonlinear decision boundary in the original space is linear in the transformed feature space!
  • Our original space had one feature ($x$), our new one has two features: so we go up a dimension!
  • This is true in general:

Properly designed features provide good nonlinear separation in the original feature space and, simultaneously, good linear separation in the transformed feature space.

Example 5. A two-dimensional example

  • How about this dataset?
In [19]:
# create instance of linear regression demo, used below and in the next examples
demo5 = nonlib.nonlinear_classification_visualizer.Visualizer(datapath + 'ellipse_2class_data.csv')

# an implementation of the least squares cost function for linear regression for N = 2 input dimension datasets
demo5.plot_data()
  • How about an elliptical decision boundary?
\begin{equation} \text{predict}(x_1,x_2) = w_0^{\,} + w_1^{\,} x_1^2 + w_2^{\,}x_2^2 \end{equation}
  • Here we use two feature transformations $f_1(x_1,x_2)=x_1^2$ and $f_2(x_1,x_2) = x_2^2$
In [37]:
# features
def f1(x):
    return (x[0])**2

def f2(x):
    return (x[1])**2
    
# prediction
def predict(x,w):
    # linear combo
    a = w[0] + w[1]*f1(x) + w[2]*f2(x)
    return a

# softmax cost
def softmax(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p,:]
        y_p = y[p]
        cost += np.log(1 + np.exp(-y_p*predict(x_p,w)))
    return cost
In [40]:
# illustrate results
demo5.static_N2_img(w_best,softmax,predict,f1,f2,view1 = [20,45],view2 = [20,30])

1.3 Conclusions

  • We have seen that when we can identify a candidate nonlinearity we can quickly swap out linearity for nonlinearity in both our regression and classification paradigms
  • Each nonlinearity is called a 'transformed feature' or just feature
  • For regression:

A properly designed feature (or set of features) provides a good nonlinear fit in the original feature space and, simultaneously, a good linear fit in the transformed feature space.

  • For classification:

Properly designed features provide good nonlinear separation in the original feature space and, simultaneously, good linear separation in the transformed feature space.

  • But rarely can we identify a complete nonlinearity / all features of a dataset the way we have here ('by eye')
  • For example, can you identify the right features for these two datasets?
  • Pretty hard! And we can visualize these!
  • And we have no chance of doing this when we have more than $N = 2$ inputs (we can't visualize!)
  • This is where neural networks (as well as kernels and trees) come in!