Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Gradient Descent Optimization
Latest

Gradient Descent Optimization

Last Updated on January 6, 2023 by Editorial Team

Last Updated on October 18, 2022 by Editorial Team

Author(s): Rohan Jagtap

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Gradient Descent Optimization Algorithms

Image of a bumpy non-convex surface
Source: Image by KhanΒ Academy

Most deep learning algorithms train by optimizing some kind of objective (or loss) function. We typically try to either maximize or minimize these functions. Gradient-based optimization is the most extensively used technique for training neural networks.

In this article, we will see the well-known Gradient Descent algorithm. We will discuss what it is, why it works, and finally, implement a gradient descent optimizer for TensorFlow.

This article is for anyone who has just heard about Gradient Descent or someone who thinks of Gradient Descent as a black box and has a shallow understanding of what happens but wants to delve deeper into why things are the way theyΒ are.

Note: This article is inspired by the β€˜The Deep Learning textbook: Chapter 4: Numerical Computation’ and the β€˜Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition notes.’

Contents

This a rather very long article, and all of these sections might not interest or concern the audience at once. Please feel free to use the following list to navigate and skip to the parts that are relevant toΒ you:

  1. The LossΒ Function
  2. Optimization
    1. Random Search
    2. Random Local Search
    3. Gradient-based Optimization
  3. Gradient Descent
    1. The Minima
    2. Why can’t we just equate the derivative to 0 and solve for x?
    3. Parameter Update Formula
    a. Why is the negative gradient the direction of the steepest descent?
    b. Why do we say β€˜going in the opposite direction of the gradient?’
  4. Jacobians and Hessians
    1. Second Derivative Test and Second-Order Optimization
    2. Why not use second-order optimization algorithms for neuralΒ nets?
  5. Computing the Gradient
    1. Using finite differences
    2. UsingΒ Calculus
  6. Implementation
  7. Conclusion
  8. References

The LossΒ Function

A loss function, which also goes by the names objective function, error function, cost function, or criterion, can be any (differentiable) function that helps evaluate an algorithm’s output against the ground truth. It can be anything from the Mean Squared Error for continuous values to Cross-entropy loss for probability distributions. While loss functions are very interesting, today, we are only concerned with how to minimizeΒ them.

To understand what we exactly want, let’s visualize a loss function. In practice, many parameters vary the loss making it hard to visualize. For simplicity, we will consider justΒ 2.

Let’s define a loss: y = L(x, z) where x, y, z are Cartesian coordinates along the X, Y, and Z axes, respectively.
Let’s take random values wx and wz for which we get the point in space: (wx, yw, wz).
Now, we can either vary some m randomly as ywβ‚˜ = L(wx + m, z) which will give us a line plot for the loss values yw along the xy-plane, or we can vary some m and n as ywβ‚˜β‚™ = L(wx + m, z + n) which will give us a three-dimensional plot. Here’s what it looks like for the Multiclass SVMΒ loss:

Loss function landscape for the Multiclass SVM
Loss function landscape for the Multiclass SVM. Left: one-dimensional loss by only varying m. Right: two-dimensional loss slice by varying m and n. Blue = low loss, Red = high loss. Source: Image by Stanford CS CS231nΒ notes

So what we ultimately want is to find some wx* and wz* such that yw (i.e., the loss) is the least (the darkest blue point). Starting from arbitrary values and working our way down to these β€˜optimal’ values is called optimization.

Optimization

In the previous section, we saw what exactly optimization in this context is. Now let’s see how we can go about this and understand what gradient-based optimization is and why it is used to train neural networks.

First, let’s think about some easy ways to doΒ this:

Random Search

Let’s pick values randomly and keep track of values that give the bestΒ loss:

x*, z* = 0, 0
bestloss = inf
for a few iterations
x, z = random(), random()
y = L(x, z)
if y < bestloss then
bestloss, x*, z* = y, x, z

There are infinitely many points in the coordinate space, and the odds of ending up with the optimal values are very low. So this is a pretty bad approach. Let’s see if we can doΒ better.

Random LocalΒ Search

We can do something similar to what we did while visualizing the loss function. We can start with random values, generate some small random Ξ”x, Ξ”z, and add them to x and z respectively. If the new values improve the loss, we keepΒ them:

x*, z* = random(), random()
bestloss = inf
for a few iterations
Ξ”x, Ξ”z = random(), random()
x, z = x* + Ξ”x, z* + Ξ”z
y = L(x, z)
if y < bestloss then
bestloss, x*, z* = y, x, z

Analogy: Think of this as a blindfolded hiker trying to climb down a hill, probing every step before advancing.

While it looks good, this approach is still very random and lacks structure. Moreover, an even bigger issue is it’s very expensive. In the worst case, one might have to probe every point on the function, which might not even be possible.

Gradient-based Optimization

Let’s go back to high school and recall elementary calculus. For a function y = f(x), the derivative f'(x) or dy/dx is defined as the rate of change of y w.r.t. small changes in x. It is also the slope of f(x) at pointΒ x.

Numerical differentiation by finite differences method
Source: Image by theΒ author

The interpretation and explanation of this expression is out of the scope of today’s discussion, but if you watch closely, you will find that it is pretty self-explanatory. It can also be seen as the standard slopeΒ formula.

We can further solve for f(x + h) and approximate itΒ to:

Approximation of f(x + h) using the finite differences method
Source: Image by theΒ author

Many textbooks and papers assume the above expression. So this is where it comesΒ from.

From the above expression, we can solve and concludeΒ that:

for a small enough Ο΅, f(x β€” Ο΅ sign(f’(x))) is smaller than f(x)

So it looks like by using derivatives like the above, we can take a β€˜small enough’ Ο΅ and reduce f(x) in small steps by updating x. This technique is called GradientΒ Descent.

This example, however, was a univariate function. Typically, we will deal with multivariate functions (sometimes with millions of variables in the case of neural nets). Each variable represents the magnitude of the function in a given β€˜direction.’ So for multivariate functions, we take partial derivatives, i.e., individual derivatives w.r.t. each direction (or variable).

The gradient, βˆ‡ f is nothing but a vector of all the partial derivatives of aΒ function

We will discuss Gradient Descent in greater detail in the nextΒ section.

Gradient Descent

So far, we’ve seen how we can minimize a function. Now let’s think about where to stop, i.e., how do we know if we have arrived at the β€˜minimum’ value of the function?

The Minima

f’(x) = 0 defines the points where the slope is 0. In two dimensions, these are the points where the slope of a function is parallel to the x-axis and thus is 0. Similar ideas follow in higher dimensions. These points are called critical or stationary points. Since the derivative is 0, there is no information about which direction toΒ move.

Graphs of minimum, maximum and saddle points in one-dimension
The three types of critical points. Source: Image by deeplearningbook
  1. A point that gives the absolute lowest value of f(x) is a global minimum. Whereas a local minimum is a point where f(x) is lower than all the neighboring points.
    Since the derivative here is 0, we cannot make a move towards much lower values ofΒ f(x).
  2. Vice versa is true for the global and localΒ maximum.
  3. Finally, a saddle point is neither a minimum nor aΒ maximum.

It is worthwhile to note that the Multiclass SVM loss function we saw earlier belongs to a type of function called the Convex function. There is a separate subfield of mathematical optimization dedicated to convex optimization.

On the contrary, a neural network is highly non-linear, and the losses are highly non-convex. It means that the concepts we derive for convex functions may not be as easily applicable in the case of neural networks.

For instance, a strictly convex function has no more than one minimum. However, in the context of deep learning, we are dealing with very complex loss functions that might have multiple local and global minimums and most likely are bumpy terrainsβ€” making gradient descent non-trivial.

Therefore, it is very difficult or, in some cases, even impossible to reach a global minimum. So we settle for a very low value of loss which might not even be aΒ minimum:

Comparison between local minima and global minimum
Comparison of multiple minimums. Source: Image by deeplearningbook

So when f'(x) = 0, if we are at the extremities of the function, one might have a very valid question:

Why can’t we just equate the derivative to 0 and solve forΒ x?

Isn’t this what we were taught in high school? So what’s wrong with it? Is iterative gradient descent really necessary?

Well, no! We can indeed live withoutΒ it.

But let’s consider thisβ€Šβ€”β€Šin the case of classic linear regression, equating the derivative of the cost function to zero and solving for the parameters involves inverting a matrix (Here’s why). Even the most efficient techniques for this have a complexity of O(nΒ³), where n is the number of parameters.

Now with neural networks, we are talking about millions of parameters. Moreover, it is a much more complex function than linear regression. So it will be even more difficult to solve. To add to it, these losses may or may not be differentiable everywhere.

Iterative methods like Gradient Descent offer a much more practical and reasonably efficient solution to these problems.

Of course, if it’s any easier or faster, we can always equate the derivative toΒ 0.

Parameter UpdateΒ Formula

Till now, we’ve discussed what we exactly want, then we talked about various approaches to go about and fixated on one, then we saw what the stopping condition looks like. Now let’s see how to actually solve for x at everyΒ step.

To start with, let’s first take a step back and recall what we deduced about f(xβ€Šβ€”β€ŠΟ΅ sign(f’(x))). This value will be less than f(x) given Ο΅ is small enough. Let’s take a moment and think about xβ€Šβ€”β€ŠΟ΅ sign(f’(x))β€Šβ€”β€Šif f’(x) is positive, then our new x will be xβ€Šβ€”β€ŠΟ΅, and if f’(x) is negative, then x will be x + Ο΅. So we are basically just using the derivative for its sign and reversing it.

In most textbooks/papers/articles on gradient descent, when they say β€˜going in the opposite direction of the gradient,’ this is what theyΒ mean.

However, this is just one variable. In practice, we are dealing with millions of them. So does this apply in the multivariate case? If yes, how? Let’s findΒ out.

Note that from now on, we will represent x as a collective set of multiple parameters (variables). So,

x = (x₁, xβ‚‚, x₃, …, xβ‚™) and
f(x) = f(x₁, xβ‚‚, x₃, …, xβ‚™)

Since we are dealing with vectors, we will need directional derivatives.

The components of the gradient vector βˆ‡β‚“f(x) represent the rates of change of the function f(x) w.r.t each of its variables xα΅’. However, we wish to know how f(x) changes as its variables change along any direction from a givenΒ point.

So the directional derivative of a function f(x) in a direction Ο… (a unit vector), at a point x, is given by the partial derivative of f(x + Ξ±Ο…) w.r.t Ξ±., i.e., βˆ‚f(x + Ξ±Ο…)/βˆ‚Ξ± when Ξ± = 0. Basically, the rate of change of f(x + Ξ±Ο…) with small changes in Ξ± because Ο… is just a direction vector and cannot change. Let’s solveΒ this:

Solution for the directional derivative of a function in terms of gradient and the unit direction vector
Source: Image by theΒ author

From the above, we can say that for any given direction Ο…, the directional derivative is proportional to the cosine of the angle between the gradient of the function f(x) at point x and the direction itself.

Now in the context of gradient descent, we can take a unit vector in any possible direction at a given point x and move in that direction. However, it is desirable to move in a direction where the f decreases the fastest. For this, the directional derivative must be the highest but in the reverse direction, i.e., the rate of change of the function with small changes in x should be greatest in the direction of -Ο…. This will happen when the cosΞΈ value is -1. So we have to pick a Ο… for which the ΞΈ = Ο€. This is nothing but the vectorΒ -βˆ‡β‚“f(x).

Using this and the earlier idea, we can come up with the following update:

x' = x - Ο΅ βˆ‡β‚“f(x)     ... where Ο΅ is the learning rate

Generally, we pick a very small value for Ο΅. Sometimes we can solve for Ο΅ such that Ο…α΅€βˆ‡β‚“f(x') = 0. Think of this as walking down a hill until you reach a point where you start walking upwards again (Since Ο…α΅€βˆ‡β‚“f(x') = 0, cosΞΈ = 0, and hence, Ο… and βˆ‡β‚“f(x') are orthogonal). Another strategy is to evaluate x’ for a few arbitrary values of Ο΅ and pick the one which gives the smallest value for f(x'). This strategy is called lineΒ search.

This is known as the method of steepest descent or gradientΒ descent.

Jacobians andΒ Hessians

So far, we’ve used the gradient βˆ‡β‚“f(x) for multivariate functions that take vectors as input (x = (x₁, xβ‚‚, …, xβ‚™)), and output scalars (y = f(x)). However, while dealing with functions that input as well as outputΒ vectors,

Multivariate function that inputs as well as outputs vectors
Source: Image by theΒ author

we use a Jacobian matrix. It is a matrix of all the partial derivatives:

Different ways of writing a Jacobian Matrix
Source: Image by theΒ author

This is particularly useful while applying the ChainΒ Rule.

Now, if we take a derivative of a derivative, we get a second derivative or a second-order derivative (f”(x)). This quantity will tell us how the first-order derivative changes as we vary the input. We can think of the second
derivative as measuring curvature.

The use of second-order derivatives is beyond the scope of this article, so we will briefly see what itΒ is:

Second Derivative Test and Second-Order Optimization

If the function is a line, just the gradient is sufficient to tell about its curvature (i.e., if slope < 0 then the function descends; if slope > 0 then it ascends; and slope = 0 means a flat line, as discussed earlier). Otherwise, we can use the second derivative. For example, for a quadratic function, the derivative is a line, so we can take the derivative of the derivative and use the aboveΒ analogy.

This is particularly useful for evaluating critical points f’(x) = 0. Here, if the second derivative f”(x) > 0, then as we move towards the right, f’(x) it increases and decreases as we go to the left. This means we are at the local minimum. Similarly, when f”(x) < 0, we are at the local maximum. However, unfortunately, when f”(x) = 0, it is inconclusive if it is a saddle point or a flat surface. However, in multiple dimensions, it is actually possible to find positive evidence of saddle points in some cases. This is known as the second derivative test.

In the case of multiple dimensions or multivariate functions, we have the Hessian matrix. These are like Jacobians but with second derivatives. Also, it is true that a Hessian is a Jacobian of the gradient.

Now, if you notice, with the information about the curvature of the function, we can actually determine a correct step size for our parameter update, which won’t overshoot our function value. We can do this by including the Hessian matrix in the optimization. One of the simplest methods that do this is called Newton’sΒ Method:

x' = x - Hβ»ΒΉβˆ‡f

This belongs to a whole family of algorithms that use the Hessian matrixβ€Šβ€”β€Šsecond-order optimization algorithms. On the other hand, the ones that use the gradient are called first-order optimization algorithms.

Why not use second-order optimization algorithms for neuralΒ nets?

They are very expensive in terms of speed and memory. For constructing the hessian matrix, we need the derivative of all the combinations of variables and gradients, which isΒ O(nΒ²).

Moreover, it is very difficult to implement second derivatives and much more complex. Here is a list of other issues with second-order optimization algorithms, which makes them infeasible to apply to neural networks.

Computing theΒ Gradient

We can do this in two ways: the finite differences formula and using calculus.

Using finite differences

We can directly apply the formula we saw earlier for all partial derivatives. However, it is recommended to use a modified versionΒ instead:

f'(x) = (f(x + h) βˆ’ f(x βˆ’ h)) / 2h

This one explores both sides of the point x. To calculate the gradient:

enumerate all parameters with i
backup = x[i]

x[i] = backup - h
fxminush = f(x)

x[i] = backup + h
fxplush = f(x)

x[i] = backup

gradient[i] = (fxplush - fxminush) / 2h

The obvious downside of this method is that it is approximate. Moreover, the above loop will run for every single update. So basically, if we have a million parameters and one thousand steps, this will run for 1000 * 1000000 iterations. This is very expensive and is used only for gradient check sanityΒ tests.

Using Calculus

Instead of using the approximate formula, we can apply the direct formula for the derivative of the function. After that, we can obtain the gradient using this derivative, which can be directly used in the updateΒ step.

However, this is not how exactly modern machine learning libraries workβ€Šβ€”β€Šinstead, they use automatic differentiation. The idea is that all the operations we do are ultimately a sequence or a combination of elementary arithmetic (addition, subtraction, multiplication, division) operations and elementary functions (sin, cos, log, etc.). So one can record the sequence of execution and apply the ChainΒ Rule.

Implementation

Finally, let’s try to implement an optimizer for TensorFlow. For more on implementing custom optimizers in TensorFlow, seeΒ this.

To test this, I trained a model for a simple MNIST classification against the built-in SGD optimizer first and then this one and compared theΒ results:

Comparison of model training performances between the in-built TensorFlow SGD optimizer and the one we implemented on the MNIST clothing data
Training MNIST classification: Left: Built-in SGD, Right: Our Optimizer. Source: Image by theΒ author

We can see that it worksΒ fine.

Conclusion

In this rather exhaustive article,

  1. We discussed the need for optimization. Then, we saw a few optimization strategies and introduced gradient-based optimization algorithms.
  2. We talked about the gradient descent algorithm and answered a few why’s and what’s w.r.t gradient descent. We also derived the expression for parameter update.
  3. We then briefly touched on Jacobians, Hessians, and the idea of second-order optimization algorithms.
  4. We saw how to calculate the gradient for any given function and compared a few approaches.
  5. And finally, we implemented a simple gradient descent optimizer for TensorFlow.

References

Why the gradient is the direction of steepest ascent: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/why-the-gradient-is-the-direction-of-steepest-ascent


Gradient Descent Optimization was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓