Breaking it Down: Gradient Descent
Last Updated on January 6, 2023 by Editorial Team
Author(s): Jacob Bumgarner
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Exploring and visualizing the mathematical fundamentals of gradient descent with Grad-Descent-Visualizer.
Outline
1. What is Gradient Descent?
2. Breaking Down Gradient Descent
2.1 Computing the Gradient
2.2 Descending the Gradient
3. Visualizing Multivariate Descents with Grad-Descent-Visualizer
3.1 Descent Montage
4. Conclusion: Contextualizing Gradient Descent
5. Resources
1. What is GradientΒ Descent?
Gradient descent is an optimization algorithm that is used to improve the performance of deep/machine learning models. Over a repeated series of training steps, gradient descent identifies optimal parameter values that minimize the output of a cost function.
In the next two sections of this post, weβll step down from this satellite-view description and break down gradient descent into something a bit easier to understand. We will also visualize the gradient descent of various test functions with my Python package, grad-descent-visualizer.
2. Breaking Down GradientΒ Descent
To gain an intuitive understanding of gradient descent, letβs first ignore machine learning and deep learning. Letβs instead start with a simple function:
The goal in gradient descent is to find the minima of a function or the lowest possible output value of that function. This means that given our above function f(x), the goal of gradient descent will be to find the value of x that leads the output of f(x) to approach 0. By visualizing this function (below), itβs quite obvious to see that x = 0 produces the minima ofΒ f(x).
The important part of gradient descent is: if we initialize x to some random number, say x = 1.8, is there some way to automatically update x so that it eventually produces the minimal output of the function? Indeed, we can automatically find this minimal output with a two-stepΒ process:
- Find the slope of the function at the point where our input parameter xΒ sits.
- Update our input parameter x by stepping it down the gradient.
In our simple gradient descent algorithm, this two-step process is repeated over and over until the output of our function stabilizes at a minimum, or reaches a defined gradient tolerance level. Of note, other more efficient descent algorithms take different approaches (e.g., RMSProp, AdaGrad,Β Adam).
2.1. Computing theΒ gradient
To find the slope (or gradient, hence gradient descent) of the function f(x) at any value of x, we can differentiate* the function. Differentiating the simple example function is simple with the power rule (below), providing us with: fβ(x) =Β 2x.
Using our starting point x = 1.8, we find our starting gradient of x (dx) to be dx =Β 3.6.
Letβs write a simple function in python to automatically compute the derivative of any input variable for f(x) =Β xΒ².
*Iβd strongly recommend checking out 3Blue1Brownβs video to intuitively understand differentiation. The differentiation of this sample function from first principles can be seenΒ here.
Gradient at x = 1.8: dx = 3.6
2.2. Descending theΒ gradient
Once we find the gradient of the starting point, we want to update our input parameter so that it steps down this gradient. Doing this will minimize the output of the function.
To move a variable down its gradient, we can simply subtract the gradient from the input parameter. However, if youβve looked closely, you may have noticed that subtracting the entire gradient from the input parameter x=1.8 would cause it to infinitely bouncing back and forth between 1.8 and -1.8, preventing it from ever coming close toΒ 0.
Instead, we can define a Learning Rate = 0.1. Weβll scale the dx with this learning rate before subtracting it from x. By tuning the learning rate, we can create βsmootherβ descents. Large learning rates produce large jumps along the function, and small learning rates lead to small steps along the function.
Lastly, weβll eventually have to stop the gradient descent. Otherwise, the algorithm would continue endlessly as the gradient approaches 0. For this example, weβll simply stop the descent once dx is less than 0.01. In your own IDE, you can alter the learning_rate and tolerance parameters to see how the iterations and the output of xΒ vary.
Function minimum found in 27 iterations. X = 0.00
As seen in the video above, our starting value of x = 1.8 was able to automatically be updated to x = 0.0 through the iterative process of gradientΒ descent.
3. Visualizing Multivariate Descents with Grad-Descent-Visualizer
Hopefully, this univariate example provided some foundational insight into what gradient descent actually does. Now letβs expand to the context of multivariate functions.
Weβll first visualize a gradient descent of Himmelblauβs function.
There are a few key differences in the descent of multivariate functions.
First, we need to compute partial derivatives to update each parameter. In Himmelblauβs function, the gradient of x depends on y (their sums are squared, requiring the chain rule). This means that the formula used to differentiate x will contain y and viceΒ versa.
Second, you may have noticed that there was only one minimum in the simple function from Section 2. In reality, there may be many unknown local minima in our cost functions. This means that the local minima that our parameters find will depend on their starting positions and the behavior of the gradient descent algorithm.
To visualize the descent of this landscape, weβre going to initialize our starting parameters as x = -0.4 and y = -0.65. We can then watch the descent of each parameter in its own dimension and a 2D descent, sliced by the position of the opposite parameter.
For greater context, letβs visualize the descent of the same point in 3D using my grad-descent-visualizer package created with the help ofΒ PyVista.
3.1 DescentΒ Montage
Now letβs visualize the descent of some more test functions! Weβll place a grid of points across each of these functions and watch how the points move as they descend whatever gradient they are sittingΒ on.
The Sphere Function.
The Griewank Function.
The Six-Hump Camel Function. Notice the many local minima of the function.
Letβs re-visualize a gridded descent of the Himmelblau Function. Notice how different parameter initializations lead to different minima.
And lastly, the Easom Function. Notice how many points sit still because they are initialized on a flat gradient.
4. Conclusion: Contextualizing GradientΒ Descent
So far, weβve worked through gradient descent with a univariate function and have visualized the descent of several multivariate functions. In reality, modern deep learning models have vastly more parameters than the functions that weβve examined.
For example, Hugging Faceβs newest natural language processing model, Bloom, has 175 billion parameters. The chained functions used in this model are also more complicated than our test functions.
However, itβs important to realize that the foundations of what weβve learned about gradient descent still apply. During each iteration of training of any deep learning model, the gradient of every parameter is calculated. This gradient will then be averaged across the training examples and then subtracted from the parameters so that they βstep downβ their gradients, pushing them to produce a minimal output from the modelβs cost function.
Thanks forΒ reading!
5. Resources
- Grad-Descent-Visualizer
- 3Blue1Brown
- Gradient Descent
- Derivatives
- Simon Fraser University: Test Functions for Optimization
- PyVista
- Michael Nielsen's Neural Networks and Deep Learning
Breaking it Down: Gradient Descent was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI