Gradient Descent Optimization
Last Updated on January 6, 2023 by Editorial Team
Last Updated on October 18, 2022 by Editorial Team
Author(s): Rohan Jagtap
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Gradient Descent Optimization Algorithms
Most deep learning algorithms train by optimizing some kind of objective (or loss) function. We typically try to either maximize or minimize these functions. Gradient-based optimization is the most extensively used technique for training neural networks.
In this article, we will see the well-known Gradient Descent algorithm. We will discuss what it is, why it works, and finally, implement a gradient descent optimizer for TensorFlow.
This article is for anyone who has just heard about Gradient Descent or someone who thinks of Gradient Descent as a black box and has a shallow understanding of what happens but wants to delve deeper into why things are the way theyΒ are.
Note: This article is inspired by the βThe Deep Learning textbook: Chapter 4: Numerical Computationβ and the βStanford CS class CS231n: Convolutional Neural Networks for Visual Recognition notes.β
Contents
This a rather very long article, and all of these sections might not interest or concern the audience at once. Please feel free to use the following list to navigate and skip to the parts that are relevant toΒ you:
- The LossΒ Function
- Optimization
1. Random Search
2. Random Local Search
3. Gradient-based Optimization - Gradient Descent
1. The Minima
2. Why canβt we just equate the derivative to 0 and solve for x?
3. Parameter Update Formula
a. Why is the negative gradient the direction of the steepest descent?
b. Why do we say βgoing in the opposite direction of the gradient?β - Jacobians and Hessians
1. Second Derivative Test and Second-Order Optimization
2. Why not use second-order optimization algorithms for neuralΒ nets? - Computing the Gradient
1. Using finite differences
2. UsingΒ Calculus - Implementation
- Conclusion
- References
The LossΒ Function
A loss function, which also goes by the names objective function, error function, cost function, or criterion, can be any (differentiable) function that helps evaluate an algorithmβs output against the ground truth. It can be anything from the Mean Squared Error for continuous values to Cross-entropy loss for probability distributions. While loss functions are very interesting, today, we are only concerned with how to minimizeΒ them.
To understand what we exactly want, letβs visualize a loss function. In practice, many parameters vary the loss making it hard to visualize. For simplicity, we will consider justΒ 2.
Letβs define a loss: y = L(x, z) where x, y, z are Cartesian coordinates along the X, Y, and Z axes, respectively.
Letβs take random values wx and wz for which we get the point in space: (wx, yw, wz).
Now, we can either vary some m randomly as ywβ = L(wx + m, z) which will give us a line plot for the loss values yw along the xy-plane, or we can vary some m and n as ywββ = L(wx + m, z + n) which will give us a three-dimensional plot. Hereβs what it looks like for the Multiclass SVMΒ loss:
So what we ultimately want is to find some wx* and wz* such that yw (i.e., the loss) is the least (the darkest blue point). Starting from arbitrary values and working our way down to these βoptimalβ values is called optimization.
Optimization
In the previous section, we saw what exactly optimization in this context is. Now letβs see how we can go about this and understand what gradient-based optimization is and why it is used to train neural networks.
First, letβs think about some easy ways to doΒ this:
Random Search
Letβs pick values randomly and keep track of values that give the bestΒ loss:
x*, z* = 0, 0
bestloss = inf
for a few iterations
x, z = random(), random()
y = L(x, z)
if y < bestloss then
bestloss, x*, z* = y, x, z
There are infinitely many points in the coordinate space, and the odds of ending up with the optimal values are very low. So this is a pretty bad approach. Letβs see if we can doΒ better.
Random LocalΒ Search
We can do something similar to what we did while visualizing the loss function. We can start with random values, generate some small random Ξx, Ξz, and add them to x and z respectively. If the new values improve the loss, we keepΒ them:
x*, z* = random(), random()
bestloss = inf
for a few iterations
Ξx, Ξz = random(), random()
x, z = x* + Ξx, z* + Ξz
y = L(x, z)
if y < bestloss then
bestloss, x*, z* = y, x, z
Analogy: Think of this as a blindfolded hiker trying to climb down a hill, probing every step before advancing.
While it looks good, this approach is still very random and lacks structure. Moreover, an even bigger issue is itβs very expensive. In the worst case, one might have to probe every point on the function, which might not even be possible.
Gradient-based Optimization
Letβs go back to high school and recall elementary calculus. For a function y = f(x), the derivative f'(x) or dy/dx is defined as the rate of change of y w.r.t. small changes in x. It is also the slope of f(x) at pointΒ x.
The interpretation and explanation of this expression is out of the scope of todayβs discussion, but if you watch closely, you will find that it is pretty self-explanatory. It can also be seen as the standard slopeΒ formula.
We can further solve for f(x + h) and approximate itΒ to:
Many textbooks and papers assume the above expression. So this is where it comesΒ from.
From the above expression, we can solve and concludeΒ that:
for a small enough Ο΅, f(x β Ο΅ sign(fβ(x))) is smaller than f(x)
So it looks like by using derivatives like the above, we can take a βsmall enoughβ Ο΅ and reduce f(x) in small steps by updating x. This technique is called GradientΒ Descent.
This example, however, was a univariate function. Typically, we will deal with multivariate functions (sometimes with millions of variables in the case of neural nets). Each variable represents the magnitude of the function in a given βdirection.β So for multivariate functions, we take partial derivatives, i.e., individual derivatives w.r.t. each direction (or variable).
The gradient, β f is nothing but a vector of all the partial derivatives of aΒ function
We will discuss Gradient Descent in greater detail in the nextΒ section.
Gradient Descent
So far, weβve seen how we can minimize a function. Now letβs think about where to stop, i.e., how do we know if we have arrived at the βminimumβ value of the function?
The Minima
fβ(x) = 0 defines the points where the slope is 0. In two dimensions, these are the points where the slope of a function is parallel to the x-axis and thus is 0. Similar ideas follow in higher dimensions. These points are called critical or stationary points. Since the derivative is 0, there is no information about which direction toΒ move.
- A point that gives the absolute lowest value of f(x) is a global minimum. Whereas a local minimum is a point where f(x) is lower than all the neighboring points.
Since the derivative here is 0, we cannot make a move towards much lower values ofΒ f(x). - Vice versa is true for the global and localΒ maximum.
- Finally, a saddle point is neither a minimum nor aΒ maximum.
It is worthwhile to note that the Multiclass SVM loss function we saw earlier belongs to a type of function called the Convex function. There is a separate subfield of mathematical optimization dedicated to convex optimization.
On the contrary, a neural network is highly non-linear, and the losses are highly non-convex. It means that the concepts we derive for convex functions may not be as easily applicable in the case of neural networks.
For instance, a strictly convex function has no more than one minimum. However, in the context of deep learning, we are dealing with very complex loss functions that might have multiple local and global minimums and most likely are bumpy terrainsβ making gradient descent non-trivial.
Therefore, it is very difficult or, in some cases, even impossible to reach a global minimum. So we settle for a very low value of loss which might not even be aΒ minimum:
So when f'(x) = 0, if we are at the extremities of the function, one might have a very valid question:
Why canβt we just equate the derivative to 0 and solve forΒ x?
Isnβt this what we were taught in high school? So whatβs wrong with it? Is iterative gradient descent really necessary?
Well, no! We can indeed live withoutΒ it.
But letβs consider thisβββin the case of classic linear regression, equating the derivative of the cost function to zero and solving for the parameters involves inverting a matrix (Hereβs why). Even the most efficient techniques for this have a complexity of O(nΒ³), where n is the number of parameters.
Now with neural networks, we are talking about millions of parameters. Moreover, it is a much more complex function than linear regression. So it will be even more difficult to solve. To add to it, these losses may or may not be differentiable everywhere.
Iterative methods like Gradient Descent offer a much more practical and reasonably efficient solution to these problems.
Of course, if itβs any easier or faster, we can always equate the derivative toΒ 0.
Parameter UpdateΒ Formula
Till now, weβve discussed what we exactly want, then we talked about various approaches to go about and fixated on one, then we saw what the stopping condition looks like. Now letβs see how to actually solve for x at everyΒ step.
To start with, letβs first take a step back and recall what we deduced about f(xβββΟ΅ sign(fβ(x))). This value will be less than f(x) given Ο΅ is small enough. Letβs take a moment and think about xβββΟ΅ sign(fβ(x))βββif fβ(x) is positive, then our new x will be xβββΟ΅, and if fβ(x) is negative, then x will be x + Ο΅. So we are basically just using the derivative for its sign and reversing it.
In most textbooks/papers/articles on gradient descent, when they say βgoing in the opposite direction of the gradient,β this is what theyΒ mean.
However, this is just one variable. In practice, we are dealing with millions of them. So does this apply in the multivariate case? If yes, how? Letβs findΒ out.
Note that from now on, we will represent x as a collective set of multiple parameters (variables). So,
x = (xβ, xβ, xβ, β¦, xβ) and
f(x) = f(xβ, xβ, xβ, β¦, xβ)
Since we are dealing with vectors, we will need directional derivatives.
The components of the gradient vector ββf(x) represent the rates of change of the function f(x) w.r.t each of its variables xα΅’. However, we wish to know how f(x) changes as its variables change along any direction from a givenΒ point.
So the directional derivative of a function f(x) in a direction Ο (a unit vector), at a point x, is given by the partial derivative of f(x + Ξ±Ο ) w.r.t Ξ±., i.e., βf(x + Ξ±Ο )/βΞ± when Ξ± = 0. Basically, the rate of change of f(x + Ξ±Ο ) with small changes in Ξ± because Ο is just a direction vector and cannot change. Letβs solveΒ this:
From the above, we can say that for any given direction Ο , the directional derivative is proportional to the cosine of the angle between the gradient of the function f(x) at point x and the direction itself.
Now in the context of gradient descent, we can take a unit vector in any possible direction at a given point x and move in that direction. However, it is desirable to move in a direction where the f decreases the fastest. For this, the directional derivative must be the highest but in the reverse direction, i.e., the rate of change of the function with small changes in x should be greatest in the direction of -Ο . This will happen when the cosΞΈ value is -1. So we have to pick a Ο for which the ΞΈ = Ο. This is nothing but the vectorΒ -ββf(x).
Using this and the earlier idea, we can come up with the following update:
x' = x - Ο΅ ββf(x) ... where Ο΅ is the learning rate
Generally, we pick a very small value for Ο΅. Sometimes we can solve for Ο΅ such that Ο α΅ββf(x') = 0. Think of this as walking down a hill until you reach a point where you start walking upwards again (Since Ο α΅ββf(x') = 0, cosΞΈ = 0, and hence, Ο and ββf(x') are orthogonal). Another strategy is to evaluate xβ for a few arbitrary values of Ο΅ and pick the one which gives the smallest value for f(x'). This strategy is called lineΒ search.
This is known as the method of steepest descent or gradientΒ descent.
Jacobians andΒ Hessians
So far, weβve used the gradient ββf(x) for multivariate functions that take vectors as input (x = (xβ, xβ,Β β¦, xβ)), and output scalars (y = f(x)). However, while dealing with functions that input as well as outputΒ vectors,
we use a Jacobian matrix. It is a matrix of all the partial derivatives:
This is particularly useful while applying the ChainΒ Rule.
Now, if we take a derivative of a derivative, we get a second derivative or a second-order derivative (fβ(x)). This quantity will tell us how the first-order derivative changes as we vary the input. We can think of the second
derivative as measuring curvature.
The use of second-order derivatives is beyond the scope of this article, so we will briefly see what itΒ is:
Second Derivative Test and Second-Order Optimization
If the function is a line, just the gradient is sufficient to tell about its curvature (i.e., if slope < 0 then the function descends; if slope > 0 then it ascends; and slope = 0 means a flat line, as discussed earlier). Otherwise, we can use the second derivative. For example, for a quadratic function, the derivative is a line, so we can take the derivative of the derivative and use the aboveΒ analogy.
This is particularly useful for evaluating critical points fβ(x) = 0. Here, if the second derivative fβ(x) > 0, then as we move towards the right, fβ(x) it increases and decreases as we go to the left. This means we are at the local minimum. Similarly, when fβ(x) < 0, we are at the local maximum. However, unfortunately, when fβ(x) = 0, it is inconclusive if it is a saddle point or a flat surface. However, in multiple dimensions, it is actually possible to find positive evidence of saddle points in some cases. This is known as the second derivative test.
In the case of multiple dimensions or multivariate functions, we have the Hessian matrix. These are like Jacobians but with second derivatives. Also, it is true that a Hessian is a Jacobian of the gradient.
Now, if you notice, with the information about the curvature of the function, we can actually determine a correct step size for our parameter update, which wonβt overshoot our function value. We can do this by including the Hessian matrix in the optimization. One of the simplest methods that do this is called NewtonβsΒ Method:
x' = x - Hβ»ΒΉβf
This belongs to a whole family of algorithms that use the Hessian matrixβββsecond-order optimization algorithms. On the other hand, the ones that use the gradient are called first-order optimization algorithms.
Why not use second-order optimization algorithms for neuralΒ nets?
They are very expensive in terms of speed and memory. For constructing the hessian matrix, we need the derivative of all the combinations of variables and gradients, which isΒ O(nΒ²).
Moreover, it is very difficult to implement second derivatives and much more complex. Here is a list of other issues with second-order optimization algorithms, which makes them infeasible to apply to neural networks.
Computing theΒ Gradient
We can do this in two ways: the finite differences formula and using calculus.
Using finite differences
We can directly apply the formula we saw earlier for all partial derivatives. However, it is recommended to use a modified versionΒ instead:
f'(x) = (f(x + h) β f(x β h)) / 2h
This one explores both sides of the point x. To calculate the gradient:
enumerate all parameters with i
backup = x[i]
x[i] = backup - h
fxminush = f(x)
x[i] = backup + h
fxplush = f(x)
x[i] = backup
gradient[i] = (fxplush - fxminush) / 2h
The obvious downside of this method is that it is approximate. Moreover, the above loop will run for every single update. So basically, if we have a million parameters and one thousand steps, this will run for 1000 * 1000000 iterations. This is very expensive and is used only for gradient check sanityΒ tests.
Using Calculus
Instead of using the approximate formula, we can apply the direct formula for the derivative of the function. After that, we can obtain the gradient using this derivative, which can be directly used in the updateΒ step.
However, this is not how exactly modern machine learning libraries workβββinstead, they use automatic differentiation. The idea is that all the operations we do are ultimately a sequence or a combination of elementary arithmetic (addition, subtraction, multiplication, division) operations and elementary functions (sin, cos, log, etc.). So one can record the sequence of execution and apply the ChainΒ Rule.
Implementation
Finally, letβs try to implement an optimizer for TensorFlow. For more on implementing custom optimizers in TensorFlow, seeΒ this.
To test this, I trained a model for a simple MNIST classification against the built-in SGD optimizer first and then this one and compared theΒ results:
We can see that it worksΒ fine.
Conclusion
In this rather exhaustive article,
- We discussed the need for optimization. Then, we saw a few optimization strategies and introduced gradient-based optimization algorithms.
- We talked about the gradient descent algorithm and answered a few whyβs and whatβs w.r.t gradient descent. We also derived the expression for parameter update.
- We then briefly touched on Jacobians, Hessians, and the idea of second-order optimization algorithms.
- We saw how to calculate the gradient for any given function and compared a few approaches.
- And finally, we implemented a simple gradient descent optimizer for TensorFlow.
References
- Deep Learning
- CS231n Convolutional Neural Networks for Visual Recognition
- Why is gradient descent required?
Why the gradient is the direction of steepest ascent: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/why-the-gradient-is-the-direction-of-steepest-ascent
- Why second order SGD convergence methods are unpopular for deep learning?
- Basic classification: Classify images of clothing | TensorFlow Core
- tf.keras.optimizers.Optimizer | TensorFlow v2.10.0
Gradient Descent Optimization was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI