In this post we take everything we have learned about higher order derivatives to define the Taylor Series of a function, a fundamental tool for mathematical optimization.

# imports from custom library
import sys
sys.path.append('../../')
import autograd.numpy as np
import matplotlib.pyplot as plt
from mlrefined_libraries import calculus_library as calclib
%load_ext autoreload
%autoreload 2

In the last few posts we described in significant detail how you can build your own Automatic Differentiator (and will have more to say about them in future posts as well). However in the interest of making these posts as modular as possible, starting in this post we will often use autorad - a free professioally built and maintained derivative calculator - to work examples so that posts like this can be read / played with independently of any posts on automatic differentiation. In short, we will be using the autograd derivative calculator for many examples going forward to help make the learning of higher level concepts easier.

Example 1. A simple example illustrating how to use autograd ¶

Autograd is an automatic derivative calculator built to differentiate general numpy code, and mathematical functions defined by numpy code in particular. Here we show off the basic usage of the calculator.

First we can define any math function we like - for example

\begin{equation} g(w) = \text{tanh}(w) \end{equation}

We express this function using numpy - or more specifically a thinly wrapped version of numpy corresponding to the autograd differentiator.

# import thinly wrapped numpy
import autograd.numpy as np

# define a math function
g = lambda w: np.tanh(w)

Autograd is an Automatic Differentiator - meaning that we can use it produce a programmatic function for the derivative of the above (i.e., not an algebraic formula for the derivative). We can do this by importing the Automatic Differentiator from the autograd library, and then shoving our function above through it

# import autograd Automatic Differentiator to compute the derivatives
from autograd import grad   

# compute the derivative of our input function
dgdw = grad(g)

This derivative function is something we can call just as we can the original function g.

# define set of points over which to plot function and derivative
w = np.linspace(-3,3,2000)

# evaluate the input function g and derivative dgdw over the input points
gvals = [g(v) for v in w]
dgvals = [dgdw(v) for v in w]

# plot the function and derivative
fig = plt.figure(figsize = (7,3))
plt.plot(w,gvals,linewidth=2)
plt.plot(w,dgvals,linewidth=2)
plt.legend(['$g(w)$',r'$\frac{\mathrm{d}}{\mathrm{d}w}g(w)$'],loc='center left', bbox_to_anchor=(1, 0.5),fontsize = 13)
plt.show()

We can compute further derivatives of this input function by using the same autograd function, only this time plugging in the derivative dgdw. Doing this once gives us the second derivative.

# compute the second derivative of our input function
dgdw2 = grad(dgdw)

We can then plot this along with the first derivative and original function.

# define set of points over which to plot function and first two derivatives
w = np.linspace(-3,3,2000)

# evaluate the input function g, first derivative dgdw, and second derivative dgdw2 over the input points
gvals = [g(v) for v in w]
dgvals = [dgdw(v) for v in w]
dg2vals = [dgdw2(v) for v in w]

# plot the function and derivative
fig = plt.figure(figsize = (7,3))
plt.plot(w,gvals,linewidth=2)
plt.plot(w,dgvals,linewidth=2)
plt.plot(w,dg2vals,linewidth=2)
plt.legend(['$g(w)$',r'$\frac{\mathrm{d}}{\mathrm{d}w}g(w)$',r'$\frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w)$'],loc='center left', bbox_to_anchor=(1, 0.5),fontsize = 13)
plt.show()

And we can keep going - computing higher and higher order derivatives - using the same process.

1. Linear approximation is only the beginning¶

In this post we go back to where it all began with the derivative - the tangent line - examining it through the lens of function approximation. This perspective allows us to generalize the notion of the tangent line, leading to the development and discussion of the Taylor Series.

1.1 A new perspective on the tangent line¶

We began our discussion of derivatives several posts ago by defining the derivative at a point as the slope of the tangent line to a given input function.

For a function $g(w)$ we then formally described the tangent line at a point $w^0$ as

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) \end{equation}

with the slope here given by the derivative $\frac{\mathrm{d}}{\mathrm{d}w}g(w^0)$. The justification for examining the tangent line / the derivative to begin with is fairly straight forward - locally (close to the point $w^0$) the tangent line looks awfully similar to the function, and so if we want to better understand $g$ near $w^0$ we can just as well look at the tangent line. This makes our lives a lot easier because a line is a fairly simple object - especially when compared to an arbitrary function $g$ - and so understanding the tangent line is always a simple affair.

Below we plot an example function with tangent line defined by the derivative at the point $w^0 = 1$.

# create area over which to evaluate everything
w = np.linspace(-3,3,2000); w_0 = 1.0; w_=np.linspace(-2+w_0,2+w_0,2000);

# define and evaluate the function, define derivative
g = lambda w: np.sin(w); dgdw = grad(g);
gvals = [g(v) for v in w]

# create tangent line at a point w_0
tangent = g(w_0) + dgdw(w_0)*(w_ - w_0)

# plot the function and derivative 
fig = plt.figure(figsize = (6,4))
plt.plot(w,gvals,c = 'k',linewidth=2,zorder = 1)
plt.plot(w_,tangent,c = [0,1,0.25],linewidth=2,zorder = 2)
plt.scatter(w_0,g(w_0),c = 'r',s=50,zorder = 3,edgecolor='k',linewidth=1)
plt.legend(['$g(w)$','tangent'],loc='center left', bbox_to_anchor=(1, 0.5),fontsize = 13)
plt.show()

If we study the form of our tangent line $h(w)$ closely, we can define in precise mathematical terms how it matches the function $g$. Notice first of all that the tangent line takes on the same value as the function $g$ at the point $w^0$. Plugging in $w^0$ into $h$ we can see that

\begin{equation} h(w^0) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w^0 - w^0) = g(w^0) \end{equation}

Next notice that the first derivative value of these two functions match as well. That is if we take the first derivative of $h$ with respect to $w$ - applying the derivative rules for elementary functions/operations we have seen in the last several posts - we can see that

\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}\left (g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0)\right) = \frac{\mathrm{d}}{\mathrm{d}w}\left ( \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0)\right) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \end{equation}

In short, with the tangent line $h$ matches $g$ exactly that at $w^0$ both the function value and derivative value are equal.

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ \end{array}

What if we turned this around - say we wanted to find a line that satisfies these two properties. We start with a general line

\begin{equation} h(w) = a_0 + a_1(w - w^0) \end{equation}

with unknown coefficients $a_0$ and $a_1$ - and we want to determine the right value for these coefficients so that the line satisfy these two properties. What would we do? Well, since the two criteria above constitute a system of equations we can compute the left hand side of both and solve for the correct values of $a_0$ and $a_1$. Computing the left hand side of each - where $h$ is our general line - we end up with a trivial system of equations to solve for both unknowns simultaneously

\begin{array} \ h(w^0) = a_0 = g(w^0) \\ \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = a_1 = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \end{array}

and behold, the coefficients are precisely those of the tangent line!

In other words, if we start with a general line and determine parameter values that satisfy the two criteria above we could have derived the tangent line from first principles.

1.2 From tangent line to tangent quadratic¶

Given that the function and derivative values of the tangent line match those of its underlying function, can we do better? Can we find a simple function that matches the function value, first derivative, and the second derivative value of $g$ at the point $w_0$? In other words, is it possible to determine a simple function $h$ that satisfies

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ 3. \,\,\, \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0) \\ \end{array}

Notice how a (tangent) line $h$ can only satisfy the first two of these properties and never the third, since it being a degree 1 polynomial $\frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w) = 0$ for all $w$. This fact implies that we need at least a degree 2 polynomial to satisfy all three criteria, since the second derivative of a degree 2 polynomial need not be equal to zero.

What sort of degree 2 polynomial could satisfy these three criteria? Starting with a general degree 2 polynomial

\begin{equation} h(w) = a_0 + a_1(w - w^0) + a_2(w - w^0)^2 \end{equation}

with unknown coefficients $a_0$, $a_1$, and $a_2$, we can evaluate the left hand side of each criterion forming a system of 3 equations and solve for these coefficients.

\begin{array} \ h(w^0) = a_0 = g(w^0) \\ \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = a_1 = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = 2a_2 = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)\\ \end{array}

With all of our coefficients solved we have a degree 2 polynomial that satisfies the three desired criteria

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) + \frac{1}{2}\frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)(w - w^0)^2 \end{equation}

This is one step beyond the tangent line - a tangent quadratic function - note that the first two terms are indeed the tangent line itself. We plot an example of the above below in the next Python cell. Notice how this degree 2 polynomial does a much better job of matching the underlying function around the point $w^0$ than does the tangent line, which is by design: its value, along with its first and second derivative value, match the underlying function's at $w^0$.

# create area over which to evaluate everything
w = np.linspace(-3,3,2000); w_0 = 1.0; w_=np.linspace(-2+w_0,2+w_0,2000);

# define and evaluate the function, define derivative
g = lambda w: np.sin(w); dgdw = grad(g); dgdw2 = grad(dgdw);
gvals = [g(v) for v in w]

# create tangent line and quadratic
tangent = g(w_0) + dgdw(w_0)*(w_ - w_0)
quadratic = g(w_0) + dgdw(w_0)*(w_ - w_0) + 0.5*dgdw2(w_0)*(w_ - w_0)**2

# plot the function and derivative 
fig = plt.figure(figsize = (7,4))
plt.plot(w,gvals,c = 'k',linewidth=2,zorder = 1)
plt.plot(w_,tangent,c = [0,1,0.25],linewidth=2,zorder = 2)
plt.plot(w_,quadratic,c = [0,0.75,1],linewidth=2,zorder = 2)
plt.scatter(w_0,g(w_0),c = 'r',s=50,zorder = 3,edgecolor='k',linewidth=1)
plt.legend(['$g(w)$','tangent line','tangent quadratic'],loc='center left', bbox_to_anchor=(1, 0.5),fontsize = 13)
plt.show()

1.3 Building better and better local approximations¶

Having derived this quadratic based on our reflection on the tangent line, one could think of going one step further. That is finding a simple function $h$ that satisfies even one more condition than the quadratic

\begin{array} \ 1. \,\,\, h(w^0) = g(w^0) \\ 2. \,\,\, \frac{\mathrm{d}}{\mathrm{d}w}h(w^0) = \frac{\mathrm{d}}{\mathrm{d}w}g(w^0) \\ 3. \,\,\, \frac{\mathrm{d}^2}{\mathrm{d}w^2}h(w^0) = \frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0) \\ 4. \,\,\, \frac{\mathrm{d}^3}{\mathrm{d}w^3}h(w^0) = \frac{\mathrm{d}^3}{\mathrm{d}w^3}g(w^0) \\ \end{array}

Noting that no degree 2 polynomial could satisfy this last condition, since its third derivative is always equal to zero, we could seek out a degree 3 polynomial. Using the same analysis as above - setting up the corresponding system of equations based on a generic degree 3 polynomial - leads to the conclusion that the following does indeed satisfy all of the criteria above

\begin{equation} h(w) = g(w^0) + \frac{\mathrm{d}}{\mathrm{d}w}g(w^0)(w - w^0) + \frac{1}{2}\frac{\mathrm{d}^2}{\mathrm{d}w^2}g(w^0)(w - w^0)^2 + \frac{1}{3\times2}\frac{\mathrm{d}^3}{\mathrm{d}w^3}g(w^0)(w - w^0)^3 \end{equation}

This is an even better approximation of $g$ near the point $w^0$, since it contains more of the function's derivative information there.

And of course we could keep going. Setting up the corresponding set of $N+1$ criteria - the first demanding that $h(w^0) = g(w^0)$ and the remaining $N$ demanding that the first $n$ derivatives of $h$ match those of $g$ at $w^0$ - leads to the construction of degree $N$ polynomial

\begin{equation} h(w^0) + g(w^0) + \sum_{n=1}^{N} \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}w^n}g(w^0)(w - w^0)^n \end{equation}

Notice how setting $N=1$ recovers the tangent line, $N=2$ the tangent quadratic, etc.,

This general degree $N$ polynomial is called the Taylor Series of $g$ at the point $w^0$. It is the degree $N$ polynomial that matches $g$ as well as its first $N$ derivatives at the point $w^0$, and therefore approximates $g$ near this point better and better as we increase $N$.

The degree $N$ polynomial $h(w^0) + g(w^0) + \sum_{n=1}^{N} \frac{1}{n!}\frac{\mathrm{d}^n}{\mathrm{d}w^n}g(w^0)(w - w^0)^n$ is called the Taylor Series of $g$ at the point $w_0$.

We illustrate the first four Taylor Series polynomials for a user-defined input function below, animated over a range of values of the input function. You can use the slider to shift the point at which each approximation is made back and forth across the input range.

# what function should we play with?  Defined in the next line.
g = lambda w: np.sin(2*w)

# create an instance of the visualizer with this function 
taylor_viz = calclib.taylor_series_simultaneous_approximations.visualizer(g = g)

# run the visualizer for our chosen input function
taylor_viz.draw_it(num_frames = 200)

Examining this figure we can clearly see that the approximation becomes better and better as we increase $N$. This makes sense, as each polynomial matches the underlying function at a point as we increase the degree. However we can never expect it to match the entire function everywhere: we build each polynomial to match $g$ at only a single point, so regardless of degree we can only expect it to match the underlying function near the point $w_0$.

1.4 The Taylor Series as a local function approximator¶

This idea of using a simple function in order to better understand another (potentially more complex) function is the germ of a centuries old area of mathematics called function approximation. In the parlance of function approximation what we have done here with the Taylor Series is to make a approximation to a function locally at a point. Throughout our study of machine learning / deep learning we will see the notion of function approximation pop up in a number of contexts.