In this post we discuss the foundational calculus-based concepts on which many practical optimization algorithms are built: the zero, first, and second order optimality conditions.
The mathematical problem of finding the smallest point(s) of a function - referred to as a function's global minimum (one point) or global minima (many) - is centuries old and has applications throughout the sciences and engineering.
The three conditions we discuss here reveal what basic calculus can tell us about how a function behaves near its global minima.
We also discuss the heuristic algorithm coordinate descent which - as a simple extension of the first order condition - provides a simple algorithmic approach to finding global minima through solving large systems of simple first order equations.
Press the button 'Toggle code' below to toggle code on and off for entire this presentation.
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)
# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)
# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
In many areas of science and engineering one is interested in finding the smallest points - or the global minima - of a particular function.
For a function $g(\mathbf{w})$ taking in a general $N$ dimensional input $\mathbf{w}$ this problem is formally phrased as
\begin{equation} \underset{\mathbf{w}}{\mbox{minimize}}\,\,\,\,g\left(\mathbf{w}\right) \end{equation}This says formally 'look over every possible input $\mathbf{w}$ and find the one that gives the smallest value of $g(\mathbf{w})$'.
Because we are visually constrained to such low dimensions as human beings we have to create tools to help us find the global minima of functions in general.
Thankfully calculus - and in particular the notion of derivative - already provides us with the fundamental tools we need to build robust schemes for finding function minima, regardless of input dimension.
Let us examine some simple examples to see how to formally characterize the global minima of a function.
Below we plot the simple quadratic
\begin{equation} g(w) = w^2 \end{equation}over a short region of its input space.
Examining the left panel below, what can we say defines the smallest value(s) of the function here?
# specify function
func = lambda w: w**2
# use custom plotter to display function
calib.derivative_visualizer.show_stationary_1func(func=func)
The smallest value - the global minimum - seemingly occurs close to $w = 0$ (we mark this point $(0,g(0))$ in the right panel with a green dot). Formally to say this point $w^0 = 0$ gives the smallest point on the function, we say
$$ g(w^0) \leq g(w) \,\,\,\text{for all $w$} $$Let us look at the sinusoid function
$$ g(w) = \text{sin}(3w) $$plotted by the next Python cell.
# specify function
func = lambda w: np.sin(2*w)
# use custom plotter to display function
calib.derivative_visualizer.show_stationary_1func(func=func)
Here we can see that there are many global minima - marked green in the right panel - one at every $4k+3$ multiple of $\frac{\pi}{2}$ for integer $k$'s. So to speak more generally we would say that $w^0$ is a global minimum if
$$ g(w^0) \leq g(w) \,\,\,\text{for all $w$} $$We likewise mark the largest points on the function - the global maxima - in the right panel, which occur at $4k+1$ multiples of $\frac{\pi}{2}$ for integer $k$'s. The maxima can be formally defined as those points satisfying the inequality above, only with the $\geq$ sign instead of $\leq$.
In the next Python cell we plot the sinusoid
$$ g(w) = \text{sin}(3w) + 0.1w^2 $$over a short region of its input space. Examining the left panel below, what can we say defines the smallest value(s) of the function here?
# specify function
func = lambda w: np.sin(3*w) + 0.1*w**2
# use custom plotter to display function
calib.derivative_visualizer.show_stationary_1func(func=func)
Here we have a global minimum around $w = -0.5$ and a global maximum around $w = 2.7$. We also have minima and maxima that are locally optimal
From these examples we have seen how to formally define global minima/maxima as well as the local minima/maxima of a function. These formal definitions directly generalize to a function taking in $N$ inputs - and together constitute the zero order condition for optimality.
The zero order condition for optimality: A point $\mathbf{w}^0$ is
- a global minimum of $g(\mathbf{w})$ if and only if $g(\mathbf{w}^0) \leq g(\mathbf{w}) \,\,\,\text{for all $\mathbf{w}$}$
- a global maximum of $g(\mathbf{w})$ if and only if $g(\mathbf{w}^0) \geq g(\mathbf{w}) \,\,\,\text{for all $\mathbf{w}$}$
- a local minimum of $g(\mathbf{w})$ if and only if $g(\mathbf{w}^0) \leq g(\mathbf{w}) \,\,\,\text{for all $\mathbf{w}$ near $\mathbf{w}^0$}$
- a local maximum of $g(\mathbf{w})$ if and only if $g(\mathbf{w}^0) \geq g(\mathbf{w}) \,\,\,\text{for all $\mathbf{w}$ near $\mathbf{w}^0$}$
In the next Python cell we plot a quadratic functions in two and three dimensions, and mark the minimum point on each - called a global minimum - with a green point.
Also in green in each panel we draw the first order Taylor Series approximation - a tangent line/hyperplane - generated by the first derivative(s) at the function's minimum value.
In terms of the behavior of the first order derivatives here we see - in both instances - that the tangent line/hyperplane is perfectly flat, indicating that the first derivative(s) is exactly zero at the minimum.
# plot a single input quadratic in both two and three dimensions
func1 = lambda w: w**2 + 3
func2 = lambda w: w[0]**2 + w[1]**2 + 3
# use custom plotter to show both functions
calib.derivative_visualizer.compare_2d3d(func1 = func1,func2 = func2)
This finding is true in general - regardless of the dimension of a function's input - first order derivatives are always zero at the global minima of a function.
This is because minimum values of a function are naturally located at 'valley floors' where a tangent line or hyperplane tangent to the function is perfectly flat, and thus has zero-valued slope(s).
Because the derivative/gradient at a point gives precisely this slope information, the value of first order derivatives provide a convenient way of finding minimum values of a function $g$.
When $N=1$ any point $v$ where
\begin{equation} \frac{\mathrm{d}}{\mathrm{d}w}g\left(v\right)=0 \end{equation}is a potential minimum.
Analogously with general $N$ dimensional input, any $N$ dimensional point $\mathbf{v}$ where every partial derivative of $g$ is zero, that is
\begin{array} \ \frac{\partial}{\partial w_{1}}g(\mathbf{v})=0\\ \frac{\partial}{\partial w_{2}}g(\mathbf{v})=0\\ \,\,\,\,\,\,\,\,\,\,\vdots \\ \frac{\partial}{\partial w_{N}}g(\mathbf{v})=0 \end{array}is a potential minimum.
Notice how this is a system of $N$ equations, which can be written more compactly using the gradient - our convenient vectorized listing of these partial derivatives - as
\begin{equation} \nabla g\left(\mathbf{v}\right)=\mathbf{0}_{N\times1} \end{equation}This is an extremely useful characterization of minimum points, and is central to the foundational algorithms of mathematical optimization. However it is not perfect as a cursory check of other simple functions quickly reveals that other types of points have zero derivative(s) as well.
Below we plot the three functions
\begin{array} \ g(w) = \text{sin}\left(2w\right) \\ g(w) = w^3 \\ g(w) = \text{sin}\left(3w\right) + 0.1w^2 \end{array}in the top row of the figure.
For each we mark all the zero derivative points in green and draw the first order Taylor Series approximations/tangent lines there in green as well.
Below each function we plot its first derivative, highlighting the points where it takes on the value zero as well (the horizontal axis in each case is drawn as a horizontal dashed black line).
# plot a single input quadratic in both two and three dimensions
func1 = lambda w: np.sin(2*w)
func2 = lambda w: w**3
func3 = lambda w: np.sin(3*w) + 0.1*w**2
# use custom plotter to show both functions
calib.derivative_visualizer.show_stationary(func1 = func1,func2 = func2,func3 = func3)
Examining these plots we can see that it is not only minima that have zero derivatives, but a variety of other points as well. These consist of
These three examples illustrate the full swath of points having zero-valued derivative(s) - and this includes multi-input functions as well regardless of dimension. Taken together all such points are collectively referred to as stationary points.
Together local/global minima and maxima, as well as saddle points are referred to as stationary points. These are points at which a function's derivative(s) take on zero value, i.e., $\frac{\partial}{\partial w}g(w) = 0$.
The first order condition for optimality: Stationary points of a function $g$ (including minima, maxima, and saddle points) satisfy the first order condition $\nabla g\left(\mathbf{v}\right)=\mathbf{0}_{N\times1}$. This allows us to translate the problem of finding global minima to the problem of solving a system of (typically nonlinear) equations, for which many algorithmic schemes have been designed.
Note: if a function is convex - as with the first example in this Section - then the first order condition completely defines its global minima, as a convex function has no maxima or saddle points.
The first order condition completely defines the global minima of convex functions, as they have no maxima or saddle points.
As mentioned, the primary practical benefit of using the first order condition is that it allows us to transform the task of seeking out global minima to that of solving a system of equations, for which a wide range of algorithmic methods have been designed.
The emphasis here on algorithmic schemes is key, as solving a system of equations by hand is generally speaking impossible. Again we emphasize, the vast majority of (nonlinear) systems of equations cannot reasonably be solved by hand.
However there are a handful of relatively simple examples one can compute by hand, or at least one can show algebraically that they reduce to a linear system of equations which can be easily solved numerically.
By far the most important of these are the multi-input quadratic function and the highly related Rayleigh quotient.
We will see the former later on when discussing linear regression, and the latter in a number of instances where we use it as a tool for studying the properties of certain machine learning cost functions.
In this Example we use the first order condition for optimality to compute stationary points of the functions
\begin{array}\\ g\left(w\right)=w^{3} \\ g\left(w\right)=e^{w} \\ g\left(w\right)=\textrm{sin}\left(w\right)\\ g\left(w\right)=a + bw + cw^{2}, \,\,\,c>0 \\ \end{array}and will distinguish the kind of stationary point visually for these instances.
Take the simple degree four polynomial
\begin{equation} g(w) = \frac{1}{50}\left(w^4 + w^2 + 10w\right) \end{equation}which is plotted over a short range of inputs containing its global minimum below.
# specify range of input for our function
w = np.linspace(-5,5,50)
g = lambda w: 1/50*(w**4 + w**2 + 10*w)
# make a table of values for our function
func_table = np.stack((w,g(w)), axis=1)
# use custom plotter to display function
baslib.basics_plotter.single_plot(table = func_table,xlabel = '$w$',ylabel = '$g(w)$',rotate_ylabel = 0)
The first order system here can be easily computed as
$$ \frac{\mathrm{d}}{\mathrm{d} w}g(w) = \frac{1}{50}\left(4w^3 + 2w + 10\right) = 0 $$which simplifies to
$$ 2w^3 + w + 5 = 0 $$This has three possible solutions, but the one providing the minimum of the function $g(w)$ is
\begin{equation} w = \frac{\sqrt[\leftroot{-2}\uproot{2}3]{\sqrt[\leftroot{-2}\uproot{2}]{2031} - 45}}{6^{\frac{2}{3}}} - \frac{1}{\left(6\sqrt{2031}-45\right)} \end{equation}which can be computed - after much toil - using centuries old tricks developed for just such problems.
Take the general multi-input quadratic function
\begin{equation} g\left(\mathbf{w}\right)=a + \mathbf{b}^{T}\mathbf{w} + \mathbf{w}^{T}\mathbf{C}\mathbf{w} \end{equation}where $\mathbf{C}$ is an $N\times N$ symmetric matrix, $\mathbf{b}$ is an $N\times 1$ vector, and $a$ is a scalar.
Computing the first derivative (gradient) we have
\begin{equation} \nabla g\left(\mathbf{w}\right)=2\mathbf{C}\mathbf{w}+\mathbf{b} \end{equation}Setting this equal to zero gives a symmetric and linear system of equations of the form
\begin{equation} \mathbf{C}\mathbf{w}=-\frac{1}{2}\mathbf{b} \end{equation}whose solutions are stationary points of the original function.
Note here we have not explicitly solved for these stationary points, but have merely shown that the first order system of equations in this particular case is in fact one of the easiest to solve numerically.
See the associated post for more examples
We have just seen that the first order condition for optimality is a powerful calculus-based way of characterizing the minima of a function.
However rarely can we use it in practice to actually solve the first order systems of equations 'by hand' that it entails in order to recover the function minima.
Why? First and foremost, even with a single variable $g(w)$ this system - which reduces to a single equation - can be difficult to impossible to solve by hand (see e.g., Example 6 above).
Greatly compounding this issue is the fact that solving a system of $N$ simultaneous equations 'by hand' - even when each individual equation is extremely simple (e.g., a linear combination) - is virtually impossible.
Coordinate descent is a heuristic mathematical optimization algorithm that is specifically designed to deal with the latter problem (the simultaneous system of $N$ equations part).
It is perhaps the first thing one might try in order to salvage the first order condition as a way of directly finding local minima of large first order systems consisting of relatively simple equations.
Essentially we do something very lazy: instead of trying to solve the equations simultaneously in every input at once, we solve it sequentially, one equation and one input at a time.
That is, we cycle through the first order equations solving the $n^{th}$ equation
\begin{equation} \frac{\partial}{\partial w_n}g(\mathbf{w}) = 0 \end{equation}for the $n^{th}$ variable $w_n$ alone.
Cycling through the first order equations a number of times this produces a solution for simple first order systems that does indeed match the solution derived by solving such systems simultaneously.
1: Input: initial point $\mathbf{w}^0$, maximum number of steps $K$
2: for
$\,\,k = 1...K$
3: for
$n=1...N$
4: solve $\frac{\partial}{\partial w_n}g(\mathbf{w}) = 0$ in $w_n$ alone setting $w_j = w_j^{k-1}$ for $j\neq n$, giving $w_n^{k}$
5: update $w_n^{k-1} \longleftarrow w_n^{k}$
6: end for
7: end for
8: output: $\mathbf{w}^{K}$
As simple as this heuristic is, coordinate descent and simple extensions of it are extremely popular in machine learning, being a widely used optimization method for a number of machine learning problems.
Examples include linear regression, K-Means clustering, nonnegative matrix factorization problems, recommender systems and general matrix factorization problems, and boosting.
Read about a specific example in the corresponding post.