Mathematical optimization schemes are the workhorse of machine learning - and at their core lies the derivative. Because of this an intuitive and rigorous understanding of the derivative - as well as other vital elements of calculus - serves one well in understanding mathematical optimization, and hence machine learning / deep learning more generally.

In this post we kick off our discussion of derivatives by exploring this idea in pictures before jumping into the math in future posts.

In the previous post we described in words and pictures what the derivative at a point is - in this post we get more formal and describe these ideas mathematically.

In this post we explore how we can derive formulae for the derivatives of generic functions constructed from elementary functions/operations, which has far-reaching and very positive consequences for the effective computing of derivatives.

In the previous post we detailed how we can derive derivative formulae for any function constructed from elementary functions and operations, and how derivatives of such functions are themselves constructed from elementary functions/operations. These facts have far-reaching consequences for the practical computing of derivatives - allowing us to construct a very effective derivative calculator called Automatic Differentiation which we detail in this post.

In previous posts we have seen how we can compute formula for the derivative of a generic function constructed out of elementary functions and operations, and that this derivative too is a generic function constructed out of elementary functions / operations. Because the derivative is a generic function it is natural to ask - what happens if we take its derivative? By the same logic and rules, we should be able to compute it in a similar manner, and for the same reasons it too should be a generic function with known equation. In turn we should then be able to compute the derivative of this formula, and so on ad infinitum.

In this post we take everything we have learned about higher order derivatives to define the Taylor Series of a function, a fundamental tool for mathematical optimization.

In this post we describe important characteristics of the hyperplane including the concept of the direction of steepest ascent. These concepts are fundamental to the notion of mult-input derivatives (the gradient), gradient descent, as well as linear regression and classification schemes.

In this post we describe how derivatives are defined in higher dimensions, when dealing with multi-input functions. We explore these ideas first with two inputs only for visualization purposes, generalizing afterwards.

Quadratic functions naturally arise when studying second order derivatives, and then by extension second order Taylor series approximations. There is nothing complicated about a quadratic of a single input, but as we go up in dimension quadratics can become significantly more complex both in terms of the variety of shapes they can take as well as their general formalities. In this post we aim to explain these complexities by discussing general quadratic functions, various ways of thinking about their construction, and the factors which control their shape.

In this post we discuss higher order derivatives of a multi-input function, as well as the corresponding Taylor Series approximations.

In this post we discuss the fundamental role second order derivatives play in describing the curvature of functions. In particular we describe how second order derivatives describe the convexity or concavity of a function locally and globally.

In this post we discuss the foundational calculus-based concepts on which many practical optimization algorithms are built: the zero, first, and second order optimality conditions. The three conditions we discuss here reveal what basic calculus can tell us about how a function behaves near its global minima. We also discuss the heuristic algorithm coordinate descent which - as a simple extension of the first order condition - provides a simple algorithmic approach to finding global minima through solving large systems of simple first order equations.