Mathematical Intuition behind the Gradient Descent Algorithm

Last Updated on October 24, 2022 by Editorial Team

Author(s): Towards AI Editorial Team

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Deriving Gradient Descent Algorithm for Mean Squared Error

Author(s): Pratik Shukla

“The mind is not a vessel to be filled, but a fire to be kindled.” — Plutarch

The Gradient Descent Series of Blogs:

Introduction:

Welcome! Today, we’re working on developing a strong mathematical intuition on how the Gradient Descent algorithm finds the best values for its parameters. Having this sense can help you catch mistakes in machine learning outputs and get even more comfortable with how the gradient descent algorithm makes machine learning so powerful. In the pages to follow, we will derive the equation of the gradient descent algorithm for the mean squared error function. We will use the results of this blog to code the gradient descent algorithm. Let’s dive into it!

Derivation of the Gradient Descent Algorithm for Mean Squared Error:

1. Step — 1:

The input data is shown in the matrix below. Here, we can observe that there are m training examples and n number of features.

Dimensions: X = (m, n)

2. Step — 2:

The expected output matrix is shown below. Our expected output matrix will be of size m*1 because we have m training examples.

Dimensions: Y = (m, 1)

3. Step — 3:

We will add a bias element in our parameters to be trained.

Dimensions: α = (1, 1)

4. Step — 4:

In our parameters, we have our weight matrix. The weight matrix will have n elements. Here, n is the number of features of our training dataset.

Dimensions: β = (1, n)

5. Step — 5:

Figure — 5: Forward propagation in simple linear regression

The predicted values for each training examples are given by,

Please note that we are taking the transpose of the weights matrix (β) to make the dimensions compatible with matrix multiplication rules.

Dimensions: predicted_value = (1, 1) + (m, n) * (1, n)

— Taking the transpose of the weights matrix (β) —

Dimensions: predicted_value = (1, 1) + (m, n) * (n, 1) = (m, 1)

6. Step — 6:

The mean squared error is defined as follows.

Dimensions: cost = scalar function

7. Step — 7:

We will use the following gradient descent rule to determine the best parameters in this case.

Figure — 8: Update the parameters using the gradient descent algorithm

Dimensions: α = (1, 1) & β = (1, n)

8. Step — 8:

Now, let’s find the partial derivative of the cost function with respect to the bias element (α).

Figure — 9: Partial derivative of the cost function w.r.t. bias

Dimensions: (1, 1)

9. Step — 9:

Now, we are trying to simplify the above equation to find the partial derivatives.

Figure — 10: Simplifying the calculations

Dimensions: u = (m, 1)

10. Step — 10:

Based on Step — 9, we can write the cost function as,

Dimensions: scalar function

11. Step — 11:

Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the intercept (α).

Figure — 12: Finding the partial derivative of the cost function w.r.t. bias

Dimensions: (m, 1)

12. Step — 12:

Next, we are calculating the first part of the partial derivative of Step — 11.

Figure — 13: Finding the partial derivative of the cost function w.r.t. u

Dimensions: (m, 1)

13. Step — 13:

Next, we calculate the second part of the partial derivative of Step — 11.

Figure — 14: Finding the partial derivative of u w.r.t. bias

Dimensions: scalar function

14. Step — 14:

Next, we multiply the results of Step — 12 and Step — 13 to find the final results.

Figure — 15: Finding the partial derivative of the cost function w.r.t. bias

Dimensions: (m, 1)

15. Step — 15:

Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the weights (β).

Figure — 16: Finding the partial derivative of the cost function w.r.t. weights

Dimensions: (1, n)

16. Step — 16:

Next, we calculate the second part of the partial derivative of Step — 15.

Figure — 17: Finding the partial derivative of u w.r.t. weights

Dimensions: (m, n)

17. Step — 17:

Next, we multiply the results of Step — 12 and Step — 16 to find the final results of the partial derivative.

Figure — 18: Finding the partial derivative of the cost function w.r.t. weights

Now, since we want to have n values of weight, we will remove the summation part from the above equation.

Figure — 19: Finding the partial derivative of the cost function w.r.t. weights

Please note that here we will have to transpose the first part of the calculations to make it compatible with the matrix multiplication rules.

Dimensions: (m, 1) * (m, n)

— Taking the transpose of the error part —

Dimensions: (1, m) * (m, n) = (1, n)

18. Step — 18:

Next, we put all the calculated values in Step — 7 to calculate the gradient rule for updating α.

Figure — 20: Updating the bias using gradient descent

Dimensions: α = (1, 1)

19. Step — 19:

Next, we put all the calculated values in Step — 7 to calculate the gradient rule for updating β.

Figure — 21: Updating the weights using gradient descent

Please note that we must transpose the error value to make the function compatible with matrix multiplication rules.

Dimensions: β = (1, n) — (1, n) = (1, n)

Working Example of the Gradient Descent Algorithm:

Now, let’s take an example to see how the gradient descent algorithm finds the best parameter values.