Join thousands of AI enthusiasts and experts at the Learn AI Community.

Publication

The Gradient Descent Algorithm and its Variants
Tutorials

The Gradient Descent Algorithm and its Variants

Last Updated on October 25, 2022 by Editorial Team

Author(s): Towards AI Editorial Team

 

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Image by Sara fromย Pixabay

Gradient Descent Algorithm with Code Examples inย Python

Author(s): Pratikย Shukla

โ€œEducating the mind without educating the heart is no education at all.โ€ โ€• Aristotle

The Gradient Descent Series ofย Blogs:

  1. The Gradient Descent Algorithm
  2. Mathematical Intuition behind the Gradient Descent Algorithm
  3. The Gradient Descent Algorithm & its Variants (You areย here!)

Table of contents:

  1. Introduction
  2. Batch Gradient Descentย (BGD)
  3. Stochastic Gradient Descentย (SGD)
  4. Mini-Batch Gradient Descentย (MBGD)
  5. Graph Comparison
  6. End Notes
  7. Resources
  8. References

Introduction:

Drumroll, please: Welcome to the finale of the Gradient Descent series! In this blog, we will dive deeper into the gradient descent algorithm. We will discuss all the fun flavors of the gradient descent algorithm along with their code examples in Python. We will also examine the differences between the algorithms based on the number of calculations performed in each algorithm. Weโ€™re leaving no stone unturned today, so we request that you run the Google Colab files as you read the document; doing so will give you a more precise understanding of the topic to see it in action. Letโ€™s get intoย it!

Batch Gradientย Descent:

Working of the Batch Gradient Descent (BGD) Algorithm

The Batch Gradient Descent (BGD) algorithm considers all the training examples in each iteration. If the dataset contains a large number of training examples and a large number of features, implementing the Batch Gradient Descent (BGD) algorithm becomes computationally expensiveโ€Šโ€”โ€Šso mind your budget! Letโ€™s take an example to understand it in a betterย way.

Batch Gradient Descentย (BGD):

Number of training examples per iterations = 1 million = 1โฐโถ
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1โฐโถ * 1โฐยณ* 1โฐโด =ย 1โฐยนยณ

Now, letโ€™s see how the Batch Gradient Descent (BGD) algorithm is implemented.

1. Stepโ€Šโ€”โ€Š1:

First, we are downloading the data file from the GitHub repository.

2. Stepโ€Šโ€”โ€Š2:

Next, we import some required libraries to read, manipulate, and visualize theย data.

3. Stepโ€Šโ€”โ€Š3:

Next, we are reading the data file, and then printing the first five rows ofย it.

4. Stepโ€Šโ€”โ€Š4:

Next, we are dividing the dataset into features and target variables.

Dimensions: X = (200, 3) & Y = (200,ย )

5. Stepโ€Šโ€”โ€Š5:

To perform matrix calculations in further steps, we need to reshape the target variable.

Dimensions: X = (200, 3) & Y = (200,ย 1)

6. Stepโ€Šโ€”โ€Š6:

Next, we are normalizing theย dataset.

Dimensions: X = (200, 3) & Y = (200,ย 1)

7. Stepโ€Šโ€”โ€Š7:

Next, we are getting the initial values for the bias and weights matrices. We will use these values in the first iteration while performing forward propagation.

Dimensions: bias = (1, 1) & weights = (1,ย 3)

8. Stepโ€Šโ€”โ€Š8:

Next, we perform the forward propagation step. This step is based on the following formula.

Dimensions: predicted_value = (1, 1)+(200, 3)*(3,1) = (1, 1)+(200, 1) = (200,ย 1)

9. Stepโ€Šโ€”โ€Š9:

Next, we are going to calculate the cost associated with our prediction. This step is based on the following formula.

Dimensions: cost = scalarย value

10. Stepโ€Šโ€”โ€Š10:

Next, we update the parameter values of weights and bias using the gradient descent algorithm. This step is based on the following formulas. Please note that the reason why weโ€™re not summing over the values of the weights is that our weight matrix is not a 1*1ย matrix.

Dimensions: db = sum(200, 1) = (1,ย 1)

Dimensions: dw = (1, 200) * (200, 3) = (1,ย 3)

Dimensions: bias = (1, 1) & weights = (1,ย 3)

11. Stepโ€Šโ€”โ€Š11:

Next, we are going to use all the functions we just defined to run the gradient descent algorithm. We are also creating an empty list called cost_list to store the cost values of all the iterations. This list will be put to use to plot a graph in furtherย steps.

12. Stepโ€Šโ€”โ€Š12:

Next, we are actually calling the function to get the final results. Please note that we are running the entire code for 200 iterations. Also, here we have specified the learning rate ofย 0.01.

13. Stepโ€Šโ€”โ€Š13:

Next, we are plotting the graph of iterations vs.ย cost.

14. Stepโ€Šโ€”โ€Š14:

Next, we are printing the final weights values after all the iterations areย done.

15. Stepโ€Šโ€”โ€Š15:

Next, we print the final bias value after all the iterations areย done.

16. Stepโ€Šโ€”โ€Š16:

Next, we plot two graphs with different learning rates to see the effect of learning rate in optimization. In the following graph we can see that the graph with a higher learning rate (0.01) converges faster than the graph with a slower learning rate (0.001). As we learned in Part 1 of the Gradient Descent series, this is because the graph with the lower learning rate takes smallerย steps.

17. Stepโ€Šโ€”โ€Š17:

Letโ€™s put it all together.

Number of Calculations:

Now, letโ€™s count the number of calculations performed in the batch gradient descent algorithm.

Bias: (training examples) x (iterations) x (parameters) = 200 * 200 * 1 =ย 40000

Weights: (training examples) x (iterations) x (parameters) = 200 * 200 *3 =ย 120000

Stochastic Gradientย Descent

Working of the Stochastic Gradient Descent (SGD) Algorithm

In the batch gradient descent algorithm, we consider all the training examples for all the iterations of the algorithm. But, if our dataset has a large number of training examples and/or features, then it gets computationally expensive to calculate the parameter values. We know our machine learning algorithm will yield more accuracy if we provide it with more training examples. But, as the size of the dataset increases, the computations associated with it also increase. Letโ€™s take an example to understand this in a betterย way.

Batch Gradient Descentย (BGD)

Number of training examples per iterations = 1 million = 1โฐโถ
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1โฐโถ*1โฐยณ*1โฐโด=1โฐยนยณ

Now, if we look at the above number, it does not give us excellent vibes! So we can say that using the Batch Gradient Descent algorithm does not seem efficient. So, to deal with this problem, we use the Stochastic Gradient Descent (SGD) algorithm. The word โ€œStochasticโ€ means random. So, instead of performing calculation on all the training examples of a dataset, we take one random example and perform the calculations on that. Sounds interesting, doesnโ€™t it? We just consider one training example per iteration in the Stochastic Gradient Descent (SGD) algorithm. Letโ€™s see how effective Stochastic Gradient Descent is based on its calculations.

Stochastic Gradient Descentย (SGD):

Number of training examples per iterations = 1
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1 * 1โฐยณ*1โฐโด=1โฐโท

Comparison with Batch Gradientย Descent:

Total computations in BGD = 1โฐยนยณ
Total computations in SGD = 1โฐโท
Evaluation: SGD is ยนโฐโถ times faster than BGD in thisย example.

Note: Please be aware that our cost function might not necessarily go down as we just take one random training example every iteration, so donโ€™t worry. However, the cost function will gradually decrease as we perform more and more iterations.

Now, letโ€™s see how the Stochastic Gradient Descent (SGD) algorithm is implemented.

1. Stepโ€Šโ€”โ€Š1:

First, we are downloading the data file from the GitHub repository.

2. Stepโ€Šโ€”โ€Š2:

Next, we are importing some required libraries to read, manipulate, and visualize theย data.

3. Stepโ€Šโ€”โ€Š3:

Next, we are reading the data file, and then printing the first five rows ofย it.

4. Stepโ€Šโ€”โ€Š4:

Next, we are dividing the dataset into features and target variables.

Dimensions: X = (200, 3) & Y = (200,ย )

5. Stepโ€Šโ€”โ€Š5:

To perform matrix calculations in further steps, we need to reshape the target variable.

Dimensions: X = (200, 3) & Y = (200,ย 1)

6. Stepโ€Šโ€”โ€Š6:

Next, we are normalizing theย dataset.

Dimensions: X = (200, 3) & Y = (200,ย 1)

7. Stepโ€Šโ€”โ€Š7:

Next, we are getting the initial values for the bias and weights matrices. We will use these values in the first iteration while performing forward propagation.

Dimensions: bias = (1, 1) & weights = (1,ย 3)

8. Stepโ€Šโ€”โ€Š8:

Next, we perform the forward propagation step. This step is based on the following formula.

Dimensions: predicted_value = (1, 1)+(200, 3)*(3,1) = (1, 1)+(200, 1) = (200,ย 1)

9. Stepโ€Šโ€”โ€Š9:

Next, weโ€™ll calculate the cost associated to our prediction. The formula used for this step is as follows. Because there will only be one value of the error, we wonโ€™t need to divide the cost function by the size of the dataset or add up all the costย values.

Dimensions: cost = scalarย value

10. Stepโ€Šโ€”โ€Š10:

Next, we update the parameter values of weights and bias using the gradient descent algorithm. This step is based on the following formulas. Please note that the reason why we are not summing over the values of the weights is that our weight matrix is not a 1*1 matrix. Also, in this case, since we have only one training example, we wonโ€™t need to perform the summation over all the examples. The updated formula is given asย follows.

Dimensions: db = (1,ย 1)

Dimensions: dw = (1, 200) * (200, 3) = (1,ย 3)

Dimensions: bias = (1, 1) & weights = (1,ย 3)

11. Stepโ€Šโ€”โ€Š11:

12. Stepโ€Šโ€”โ€Š12:

Next, we are actually calling the function to get the final results. Please note that we are running the entire code for 200 iterations. Also, here we have specified the learning rate ofย 0.01.

13. Stepโ€Šโ€”โ€Š13:

Next, we print the final weights values after all the iterations areย done.

14. Stepโ€Šโ€”โ€Š14:

Next, we print the final bias value after all the iterations areย done.

15. Stepโ€Šโ€”โ€Š15:

Next, we are plotting the graph of iterations vs.ย cost.

16. Stepโ€Šโ€”โ€Š16:

Next, we plot two graphs with different learning rates to see the effect of learning rate in optimization. In the following graph we can see that the graph with a higher learning rate (0.01) converges faster than the graph with a slower learning rate (0.001). Again, we know this because the graph with a lower learning rate takes smallerย steps.

17. Stepโ€Šโ€”โ€Š17:

Putting it all together.

Calculations:

Now, letโ€™s count the number of calculations performed in implementing the batch gradient descent algorithm.

Bias: (training examples) x (iterations) x (parameters) = 1* 200 * 1 =ย 200

Weights: (training examples) x (iterations) x (parameters) = 1* 200 *3 =ย 600

Mini-Batch Gradient Descent Algorithm:

Working of the Mini-Batch Gradient Descent (MBGD) Algorithm

In the Batch Gradient Descent (BGD) algorithm, we consider all the training examples for all the iterations of the algorithm. However, in the Stochastic Gradient Descent (SGD) algorithm, we only consider one random training example. Now, in the Mini-Batch Gradient Descent (MBGD) algorithm, we consider a random subset of training examples in each iteration. Since this is not as random as SGD, we reach closer to the global minimum. However, MBGD is susceptible to getting stuck into local minima. Letโ€™s take an example to understand this in a betterย way.

Batch Gradient Descentย (BGD):

Number of training examples per iterations = 1 million = 1โฐโถ
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1โฐโถ*1โฐยณ*1โฐโด=1โฐยนยณ

Stochastic Gradient Descentย (SGD):

Number of training examples per iterations = 1
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1*1โฐยณ*1โฐโด =ย 1โฐโท

Mini Batch Gradient Descentย (MBGD):

Number of training examples per iterations = 100 = 1โฐยฒ
โ†’Here, we are considering 1โฐยฒ training examples out of 1โฐโถ.
Number of iterations = 1000 = 1โฐยณ
Number of parameters to be trained = 10000 = 1โฐโด
Total computations = 1โฐยฒ*1โฐยณ*1โฐโด=1โฐโน

Comparison with Batch Gradient Descentย (BGD):

Total computations in BGD = 1โฐยนยณ
Total computations in MBGD =ย 1โฐโน

Evaluation: MBGD is 1โฐโด times faster than BGD in thisย example.

Comparison with Stochastic Gradient Descentย (SGD):

Total computations in SGD = 1โฐโท
Total computations in MBGD =ย 1โฐโน

Evaluation: SGD is 1โฐยฒ times faster than MBGD in thisย example.

Comparison of BGD, SGD, andย MBGD:

Total computations in BGD= 1โฐยนยณ
Total computations in SGD= 1โฐโท
Total computations in MBGD =ย 1โฐโน

Evaluation: SGD > MBGD >ย BGD

Note: Please be aware that our cost function might not necessarily go down as we are taking a random sample of the training examples every iteration. However, the cost function will gradually decrease as we perform more and more iterations.

Now, letโ€™s see how the Mini-Batch Gradient Descent (MBGD) algorithm is implemented in practice.

1. Stepโ€Šโ€”โ€Š1:

First, we are downloading the data file from the GitHub repository.

2. Stepโ€Šโ€”โ€Š2:

Next, we are importing some required libraries to read, manipulate, and visualize theย data.

3. Stepโ€Šโ€”โ€Š3:

Next, we are reading the data file, and then print the first five rows ofย it.

4. Stepโ€Šโ€”โ€Š4:

Next, we are dividing the dataset into features and target variables.

Dimensions: X = (200, 3) & Y = (200,ย )

5. Stepโ€Šโ€”โ€Š5:

To perform matrix calculations in further steps, we need to reshape the target variable.

Dimensions: X = (200, 3) & Y = (200,ย 1)

6. Stepโ€Šโ€”โ€Š6:

Next, we are normalizing theย dataset.

Dimensions: X = (200, 3) & Y = (200,ย 1)

7. Stepโ€Šโ€”โ€Š7:

Next, we are getting the initial values for the bias and weights matrices. We will use these values in the first iteration while performing forward propagation.

Dimensions: bias = (1, 1) & weights = (1,ย 3)

8. Stepโ€Šโ€”โ€Š8:

Next, we are performing the forward propagation step. This step is based on the following formula.

Dimensions: predicted_value = (1, 1)+(200, 3)*(3,1) = (1, 1)+(200, 1) = (200,ย 1)

9. Stepโ€Šโ€”โ€Š9:

Next, we are going to calculate the cost associated with our prediction. This step is based on the following formula.

Dimensions: cost = scalarย value

10. Stepโ€Šโ€”โ€Š10:

Next, we update the parameter values of weights and bias using the gradient descent algorithm. This step is based on the following formulas. Please note that the reason why we are not summing over the values of the weights is that our weight matrix is not a 1*1ย matrix.

Dimensions: db = sum(200, 1) = (1ย ,ย 1)

Dimensions: dw = (1, 200) * (200, 3) = (1,ย 3)

Dimensions: bias = (1, 1) & weights = (1,ย 3)

11. Stepโ€Šโ€”โ€Š11:

Next, we are going to use all the functions we just defined to run the gradient descent algorithm. Also, we are creating an empty list called cost_list to store the cost values of all the iterations. We will use this list to plot a graph in furtherย steps.

12. Stepโ€Šโ€”โ€Š12:

Next, we are actually calling the function to get the final results. Please note that we are running the entire code for 200 iterations. Also, here we have specified the learning rate ofย 0.01.

13. Stepโ€Šโ€”โ€Š13:

Next, we print the final weights values after all the iterations areย done.

14. Stepโ€Šโ€”โ€Š14:

Next, we print the final bias value after all the iterations areย done.

15. Stepโ€Šโ€”โ€Š15:

Next, we are plotting the graph of iterations vs.ย cost.

16. Stepโ€Šโ€”โ€Š16:

Next, we plot two graphs with different learning rates to see the effect of learning rate in optimization. In the following graph we can see that the graph with a higher learning rate (0.01) converges faster than the graph with a slower learning rate (0.001). The reason behind it is that the graph with lower learning rate takes smallerย steps.

17. Stepโ€Šโ€”โ€Š17:

Putting it all together.

Calculations:

Now, letโ€™s count the number of calculations performed in implementing the batch gradient descent algorithm.

Bias: (training examples) x (iterations) x (parameters) = 20 * 200 * 1 =ย 4000

Weights: (training examples) x (iterations) x (parameters) = 20 * 200 *3 =ย 12000

Graph comparisons:

Comparison of Batch, Stochastic, and Mini Batch Gradient Descent Algorithm

End Notes:

And just like that, weโ€™re at the end of the Gradient Descent series! In this installment, we went deep into the code to look at how three of the major types of gradient descent algorithms perform next to each other, summed up by these handyย notes:

1. Batch Gradient Descent
Accuracy โ†’ High
Time โ†’ย More

2. Stochastic Gradient Descent
Accuracy โ†’ Low
Time โ†’ย Less

3. Mini-Batch Gradient Descent
Accuracy โ†’ Moderate
Time โ†’ย Moderate

We hope you enjoyed this series and learned something new, no matter your starting point or machine learning background. Knowing this essential algorithm and its variants will likely prove valuable as you continue on your AI journey and understand more about both the technical and grand aspects of this incredible technology. Keep an eye out for other blogs offering, even more, machine learning lessons, and stayย curious!

Buy Pratik aย Coffee!

Resources:

  1. Batch Gradient Descentโ€Šโ€”โ€ŠGoogle Colab,ย GitHub
  2. Stochastic Gradient Descentโ€Šโ€”โ€ŠGoogle Colab,ย GitHub
  3. Mini Batch Gradient Descentโ€Šโ€”โ€ŠGoogle Colab,ย GitHub

Citation:

For attribution in academic contexts, please cite this workย as:

Shukla, et al., โ€œThe Gradient Descent Algorithm & its Variantsโ€, Towards AI, 2022

BibTex Citation:

@article{pratik_2022, 
 title={The Gradient Descent Algorithm & its Variants}, 
 url={https://towardsai.net/neural-networks-with-python}, 
 journal={Towards AI}, 
 publisher={Towards AI Co.}, 
 author={Pratik, Shukla},
 editor={Lauren, Keegan},  
 year={2022}, 
 month={Oct}
}

References:

  1. Gradient descentโ€Šโ€”โ€ŠWikipedia


The Gradient Descent Algorithm and its Variants was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

 

Join thousands of data leaders on the AI newsletter. Itโ€™s free, we donโ€™t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aย sponsor.

Published via Towards AI

Feedback โ†“