In this brief post we motivate the study of mathematical optimization, the collection of methods built on basic calculus by which we determine proper parameters for machine learning / deep learning models. Afterwards we describe two naive methods of mathematical optimization - random evaluation and random search. Both approaches are extremely simple, but both have fundamental flaws that make them unfit for most modern deep learning / machine learning applications. We discuss them primarily because they allow us to introduce a number of key concepts - concepts shared by almost all effective mathematical optimization algorithms - in a relatively simple environment, divorced from the complications involved with more sophisticated algorithms.

In this post we describe our first major mathematical optimization algorithm: gradient descent. Many of the core ideas we have seen with regards random local search - including e.g., steplengths and steplength rules, convergence issues, etc., - we will see carry over almost entirely to the gradient descent scheme. Specifically, we will see that unlike random search, gradient descent scales very nicely with the input dimension of a function, and so can be (and is) used with modern machine learning / deep learning problems.

In this post we discuss a mathematical framework that provides choices for the gradient descent steplength that are guaranteed to produce convergence. In particular we will discuss the notions of backtracking line search as well as the conservatively optimal fixed steplength defined by a function's Lipschitz constant.

In this post we describe several variations of normalized gradient descent, known generally as steepest descent algorithms. These variations arise out of a small but fundamental change to the gradient descent setup, a change that has significant impact on the form of a descent direction but, importantly, not on the convergence of the methods. Because normalized gradient approaches have the benefit of being able to push through flat regions of a non-convex function, many popular optimizers for deep neural networks are built on the basic methods presented here, e.g., the Resilient Backpropagation algorithm or Rprop, as well as its variations like Root Mean Square Propagation or RMSprop.

With gradient descent being built on the first order Taylor series approximation one might naturally ask: can we build a descent scheme based on second order Taylor series? The answer is a resounding yes and the resulting algorithm - called Newton's method - is the subject of this post.