
Gradient Descent Algorithm Explained
Last Updated on June 21, 2020 by Editorial Team
Author(s): Pratik Shukla
Machine Learning
With Step-By-Step Mathematical Derivation

Index:
- Basics Of Gradientย Descent.
- Basic Rules Of Derivation.
- Gradient Descent With One Variable.
- Gradient Descent With Two Variables.
- Gradient Descent For Mean Squared Error Function.
What is Gradientย Descent?
Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.
How does itย work?
- Start with initialย values.
- Calculate cost.
- Update values using the update function.
- Returns minimized cost for our costย function
Why do we needย it?
Generally, what we do is, we find the formula that gives us the optimal values for our parameter. But in this algorithm, it finds the value by itself! Interesting, isnโtย it?
Formula:

Some Basic Rules For Derivation:
( A ) Scalar Multipleย Rule:

( B ) Sumย Rule:

( C ) Powerย Rule:

( D ) Chainย Rule:

Letโs have a look at various examples to understand itย better.
Gradient Descent MinimizationโโโSingle Variable:
Weโre going to be using gradient descent to find ฮธ that minimizes the cost. But letโs forget the Mean Squared Error (MSE) cost function for a moment and take a look at gradient descent function inย general.
Now what we generally do is, find the best value of our parameters using some sort of simplification and make a function that will give us minimized cost. But here what weโll do is take some default or random for our parameters and let our program run iteratively to find the minimized cost.
Letโs Explore It In-Depth:
Letโs take a very simple function to begin with: J(ฮธ) = ฮธยฒ, and our goal is to find the value of ฮธ which minimizes J(ฮธ).
From our cost function, we can clearly say that it will be minimum for ฮธ= 0, but it wonโt be so easy to derive such conclusions while working with some complex functions.
( A ) Cost function: Weโll try to minimize the value of this function.

( B ) Goal: To minimize the cost function.

( C ) Update Function: Initially we take a random number for our parameters, which are not optimal. To make it optimal we have to update it at each iteration. This function takes care ofย it.

( D ) Learning rate: The descentย speed.

( E ) Updating Parameters:

( F ) Table Generation:
Here we are stating with ฮธ =ย 5.
keep in mind that here ฮธ = 0.8*ฮธ, for our learning rate and cost function.


Here we can see that as our ฮธ is decreasing the cost value is also decreasing. We just have to find the optimal value for it. To find the optimal value we have to do perform many iterations. The more the iterations, the more optimal value weย get!
( G ) Graph: We can plot the graph of the aboveย points.

Cost Function Derivative:
Why does the gradient descent use the derivative of the cost function? We want our cost function to be minimum, right? Minimizing the cost function simply gives us a lower error rate in predicting values. Ideally, we take the derivative of a function to 0 and find the parameters. Here we do the same thing but we start from a random number and try to minimize it iteratively.
The learning rate /ย ALPHA:
The learning rate gives us solid control over how large of steps we make. Selecting the right learning rate is a very critical task. If the learning rate is too high then you might overstep the minimum and diverge. For example, in the above example if we take alpha =2 then each iteration will take us away from the minimum. So we use small alpha values. But the only concern with using a small learning rate is we have to perform more iteration to reach the minimum cost value, this increases trainingย time.
Convergence / Stopping gradientย descent:
Note that in the above example the gradient descent will never actually converge to a minimum of theta= 0. Methods for deciding when to stop our iterations are beyond my level of expertise. But I can tell you that while doing assignments we can take a fixed number of iterations like 100 orย 1000.
Gradient DescentโโโMultiple Variables:
Our ultimate goal is to find the parameters for MSE function, which includes multiple variables. So here we will discuss a cost function which as 2 variables. Understanding this will help us very much in our MSE Cost function.
Letโs take this function:

When there are multiple variables in the minimization objective, we have to define separate rules for update function. With more than one parameter in our cost function, we have to use partial derivative. Here I simplified the partial derivative process. Letโs have a look atย this.
( A ) Cost Function:

( B )ย Goal:

( C ) Updateย Rules:

( D ) Derivatives:


( E ) Updateย Values:

( F ) Learningย Rate:

( G )ย Table:
Starting with ฮธ1 =1ย ,ฮธ2 =1. And then updating the value using update functions.

( H )ย Graph:

Here we can see that as we increase our number of iterations, our cost value is goingย down.
Note that while implementing the program in python the new values must not be updated until we find new values for both ฮธ1 and ฮธ2. We clearly donโt want to use the new value of ฮธ1 to be used in the old value ofย ฮธ2.
Gradient Descent For Mean Squaredย Error:
Now that we know how to perform gradient descent on an equation with multiple variables, we can return to looking at gradient descent on our MSE cost function.
Letโs getย started!
( A ) Hypothesis function:

( B ) cost function:

( C ) Find partial derivative of J(ฮธ0,ฮธ1) w.r.t toย ฮธ1:

( D ) Simplify aย little:

( E ) Define a variableย u:

( F ) Value ofย u:

( G ) Finding partial derivative:


( H ) Rewriting the equations:

( I ) Merge all the calculated data:

( J ) Repeat the same process for derivation of J(ฮธ0,ฮธ1) w.r.tย ฮธ1:

( K ) Simplified calculations:

( L ) Combine all calculated Data:

One Half Mean Squared Errorย :
We multiply our MSE cost function with 1/2, so that when we take the derivative the 2s cancel out. Multiplying the cost function by a scalar does not affect the location of the minimum, so we can get away withย this.
Finalย :
( A ) Cost Function: One Half Mean Squaredย Error:

( B )ย Goal:

( C ) Updateย Rule:

( D ) Derivatives:


So, thatโs it. We finally madeย it!
Conclusion:
We are going to use the same method in various applications of machine learning algorithms. But at that time we are not going to go in this depth, weโre just going to use the final formula. But itโs always good to know how itโsย derived!
Final Formula:

Is the concept lucid to you now? Please let me know by writing responses. If you enjoyed this article then hit the clapย icon.
If you have any additional confusions, feel free to contact me. [email protected]
Gradient Descent Algorithm Explained was originally published in Towards AIโโโMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI