The Gradient Descent Algorithm and its Variants

Last Updated on October 25, 2022 by Editorial Team

Author(s): Towards AI Editorial Team

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

The Gradient Descent Algorithm and its Variants — Image by Sara from Pixabay

Gradient Descent Algorithm with Code Examples in Python

Author(s): Pratik Shukla

“Educating the mind without educating the heart is no education at all.” ― Aristotle

The Gradient Descent Series of Blogs:

Introduction:

Drumroll, please: Welcome to the finale of the Gradient Descent series! In this blog, we will dive deeper into the gradient descent algorithm. We will discuss all the fun flavors of the gradient descent algorithm along with their code examples in Python. We will also examine the differences between the algorithms based on the number of calculations performed in each algorithm. We’re leaving no stone unturned today, so we request that you run the Google Colab files as you read the document; doing so will give you a more precise understanding of the topic to see it in action. Let’s get into it!

Batch Gradient Descent:

The Batch Gradient Descent (BGD) algorithm considers all the training examples in each iteration. If the dataset contains a large number of training examples and a large number of features, implementing the Batch Gradient Descent (BGD) algorithm becomes computationally expensive — so mind your budget! Let’s take an example to understand it in a better way.

Batch Gradient Descent (BGD):

Number of training examples per iterations = 1 million = 1⁰⁶
Number of iterations = 1000 = 1⁰³
Number of parameters to be trained = 10000 = 1⁰⁴
Total computations = 1⁰⁶ * 1⁰³* 1⁰⁴ = 1⁰¹³

Now, let’s see how the Batch Gradient Descent (BGD) algorithm is implemented.

1. Step — 1:

First, we are downloading the data file from the GitHub repository.

2. Step — 2:

Next, we import some required libraries to read, manipulate, and visualize the data.

3. Step — 3:

Next, we are reading the data file, and then printing the first five rows of it.

4. Step — 4:

Next, we are dividing the dataset into features and target variables.

Dimensions: X = (200, 3) & Y = (200, )

5. Step — 5:

To perform matrix calculations in further steps, we need to reshape the target variable.

Dimensions: X = (200, 3) & Y = (200, 1)

6. Step — 6:

Next, we are normalizing the dataset.

Dimensions: X = (200, 3) & Y = (200, 1)

7. Step — 7:

Next, we are getting the initial values for the bias and weights matrices. We will use these values in the first iteration while performing forward propagation.

Dimensions: bias = (1, 1) & weights = (1, 3)

8. Step — 8:

Next, we perform the forward propagation step. This step is based on the following formula.

Dimensions: predicted_value = (1, 1)+(200, 3)*(3,1) = (1, 1)+(200, 1) = (200, 1)

9. Step — 9:

Next, we are going to calculate the cost associated with our prediction. This step is based on the following formula.

Dimensions: cost = scalar value

10. Step — 10:

Next, we update the parameter values of weights and bias using the gradient descent algorithm. This step is based on the following formulas. Please note that the reason why we’re not summing over the values of the weights is that our weight matrix is not a 1*1 matrix.

Dimensions: db = sum(200, 1) = (1, 1)

Dimensions: dw = (1, 200) * (200, 3) = (1, 3)

Dimensions: bias = (1, 1) & weights = (1, 3)

11. Step — 11:

Next, we are going to use all the functions we just defined to run the gradient descent algorithm. We are also creating an empty list called cost_list to store the cost values of all the iterations. This list will be put to use to plot a graph in further steps.

12. Step — 12:

Next, we are actually calling the function to get the final results. Please note that we are running the entire code for 200 iterations. Also, here we have specified the learning rate of 0.01.

13. Step — 13:

Next, we are plotting the graph of iterations vs. cost.

14. Step — 14:

Next, we are printing the final weights values after all the iterations are done.

15. Step — 15:

Next, we print the final bias value after all the iterations are done.

16. Step — 16:

Next, we plot two graphs with different learning rates to see the effect of learning rate in optimization. In the following graph we can see that the graph with a higher learning rate (0.01) converges faster than the graph with a slower learning rate (0.001). As we learned in Part 1 of the Gradient Descent series, this is because the graph with the lower learning rate takes smaller steps.

17. Step — 17:

Let’s put it all together.

Number of Calculations:

Now, let’s count the number of calculations performed in the batch gradient descent algorithm.

Bias: (training examples) x (iterations) x (parameters) = 200 * 200 * 1 = 40000

Weights: (training examples) x (iterations) x (parameters) = 200 * 200 *3 = 120000

Stochastic Gradient Descent

In the batch gradient descent algorithm, we consider all the training examples for all the iterations of the algorithm. But, if our dataset has a large number of training examples and/or features, then it gets computationally expensive to calculate the parameter values. We know our machine learning algorithm will yield more accuracy if we provide it with more training examples. But, as the size of the dataset increases, the computations associated with it also increase. Let’s take an example to understand this in a better way.

Batch Gradient Descent (BGD)

Number of training examples per iterations = 1 million = 1⁰⁶
Number of iterations = 1000 = 1⁰³
Number of parameters to be trained = 10000 = 1⁰⁴
Total computations = 1⁰⁶*1⁰³*1⁰⁴=1⁰¹³

Now, if we look at the above number, it does not give us excellent vibes! So we can say that using the Batch Gradient Descent algorithm does not seem efficient. So, to deal with this problem, we use the Stochastic Gradient Descent (SGD) algorithm. The word “Stochastic” means random. So, instead of performing calculation on all the training examples of a dataset, we take one random example and perform the calculations on that. Sounds interesting, doesn’t it? We just consider one training example per iteration in the Stochastic Gradient Descent (SGD) algorithm. Let’s see how effective Stochastic Gradient Descent is based on its calculations.

Stochastic Gradient Descent (SGD):

Number of training examples per iterations = 1
Number of iterations = 1000 = 1⁰³
Number of parameters to be trained = 10000 = 1⁰⁴
Total computations = 1 * 1⁰³*1⁰⁴=1⁰⁷

Comparison with Batch Gradient Descent:

Total computations in BGD = 1⁰¹³
Total computations in SGD = 1⁰⁷
Evaluation: SGD is ¹⁰⁶ times faster than BGD in this example.

Note: Please be aware that our cost function might not necessarily go down as we just take one random training example every iteration, so don’t worry. However, the cost function will gradually decrease as we perform more and more iterations.

Now, let’s see how the Stochastic Gradient Descent (SGD) algorithm is implemented.