Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Mathematical Intuition behind the Gradient Descent Algorithm
Tutorials

Mathematical Intuition behind the Gradient Descent Algorithm

Last Updated on October 24, 2022 by Editorial Team

Author(s): Towards AI Editorial Team

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Image by Gerd Altmann fromΒ Pixabay

Deriving Gradient Descent Algorithm for Mean SquaredΒ Error

Author(s): PratikΒ Shukla

β€œThe mind is not a vessel to be filled, but a fire to be kindled.β€β€Šβ€”β€ŠPlutarch

The Gradient Descent Series ofΒ Blogs:

  1. The Gradient Descent Algorithm
  2. Mathematical Intuition behind the Gradient Descent Algorithm (You areΒ here!)
  3. The Gradient Descent Algorithm & itsΒ Variants

Table of contents:

  1. Introduction
  2. Derivation of the Gradient Descent Algorithm for Mean SquaredΒ Error
  3. Working Example of the Gradient Descent Algorithm
  4. End Notes
  5. References and Resources

Introduction:

Welcome! Today, we’re working on developing a strong mathematical intuition on how the Gradient Descent algorithm finds the best values for its parameters. Having this sense can help you catch mistakes in machine learning outputs and get even more comfortable with how the gradient descent algorithm makes machine learning so powerful. In the pages to follow, we will derive the equation of the gradient descent algorithm for the mean squared error function. We will use the results of this blog to code the gradient descent algorithm. Let’s dive intoΒ it!

Derivation of the Gradient Descent Algorithm for Mean SquaredΒ Error:

1. Stepβ€Šβ€”β€Š1:

The input data is shown in the matrix below. Here, we can observe that there are m training examples and n number of features.

Figureβ€Šβ€”β€Š1: The inputΒ features

Dimensions: X = (m,Β n)

2. Stepβ€Šβ€”β€Š2:

The expected output matrix is shown below. Our expected output matrix will be of size m*1 because we have m training examples.

Figureβ€Šβ€”β€Š2: The expectedΒ output

Dimensions: Y = (m,Β 1)

3. Stepβ€Šβ€”β€Š3:

We will add a bias element in our parameters to beΒ trained.

Figureβ€Šβ€”β€Š3: The biasΒ element

Dimensions: Ξ± = (1,Β 1)

4. Stepβ€Šβ€”β€Š4:

In our parameters, we have our weight matrix. The weight matrix will have n elements. Here, n is the number of features of our trainingΒ dataset.

Figureβ€Šβ€”β€Š4: The weights forΒ inputs

Dimensions: Ξ² = (1,Β n)

5. Stepβ€Šβ€”β€Š5:

Figureβ€Šβ€”β€Š5: Forward propagation in simple linear regression

The predicted values for each training examples are givenΒ by,

Figureβ€Šβ€”β€Š6: The predicted values

Please note that we are taking the transpose of the weights matrix (Ξ²) to make the dimensions compatible with matrix multiplication rules.

Dimensions: predicted_value = (1, 1) + (m, n) * (1,Β n)

β€” Taking the transpose of the weights matrix (Ξ²)Β β€”

Dimensions: predicted_value = (1, 1) + (m, n) * (n, 1) = (m,Β 1)

6. Stepβ€Šβ€”β€Š6:

The mean squared error is defined asΒ follows.

Figureβ€Šβ€”β€Š7: The costΒ function

Dimensions: cost = scalarΒ function

7. Stepβ€Šβ€”β€Š7:

We will use the following gradient descent rule to determine the best parameters in thisΒ case.

Figureβ€Šβ€”β€Š8: Update the parameters using the gradient descent algorithm

Dimensions: Ξ± = (1, 1) & Ξ² = (1,Β n)

8. Stepβ€Šβ€”β€Š8:

Now, let’s find the partial derivative of the cost function with respect to the bias elementΒ (Ξ±).

Figureβ€Šβ€”β€Š9: Partial derivative of the cost function w.r.t.Β bias

Dimensions: (1,Β 1)

9. Stepβ€Šβ€”β€Š9:

Now, we are trying to simplify the above equation to find the partial derivatives.

Figureβ€Šβ€”β€Š10: Simplifying the calculations

Dimensions: u = (m,Β 1)

10. Stepβ€Šβ€”β€Š10:

Based on Stepβ€Šβ€”β€Š9, we can write the cost functionΒ as,

Figureβ€Šβ€”β€Š11: The costΒ function

Dimensions: scalarΒ function

11. Stepβ€Šβ€”β€Š11:

Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the intercept (Ξ±).

Figureβ€Šβ€”β€Š12: Finding the partial derivative of the cost function w.r.t.Β bias

Dimensions: (m,Β 1)

12. Stepβ€Šβ€”β€Š12:

Next, we are calculating the first part of the partial derivative of Stepβ€Šβ€”β€Š11.

Figureβ€Šβ€”β€Š13: Finding the partial derivative of the cost function w.r.t.Β u

Dimensions: (m,Β 1)

13. Stepβ€Šβ€”β€Š13:

Next, we calculate the second part of the partial derivative of Stepβ€Šβ€”β€Š11.

Figureβ€Šβ€”β€Š14: Finding the partial derivative of u w.r.t.Β bias

Dimensions: scalarΒ function

14. Stepβ€Šβ€”β€Š14:

Next, we multiply the results of Stepβ€Šβ€”β€Š12 and Stepβ€Šβ€”β€Š13 to find the finalΒ results.

Figureβ€Šβ€”β€Š15: Finding the partial derivative of the cost function w.r.t. bias
Figureβ€Šβ€”β€Š15: Finding the partial derivative of the cost function w.r.t.Β bias

Dimensions: (m,Β 1)

15. Stepβ€Šβ€”β€Š15:

Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the weightsΒ (Ξ²).

Figureβ€Šβ€”β€Š16: Finding the partial derivative of the cost function w.r.t.Β weights

Dimensions: (1,Β n)

16. Stepβ€Šβ€”β€Š16:

Next, we calculate the second part of the partial derivative of Stepβ€Šβ€”β€Š15.

Figureβ€Šβ€”β€Š17: Finding the partial derivative of u w.r.t.Β weights

Dimensions: (m,Β n)

17. Stepβ€Šβ€”β€Š17:

Next, we multiply the results of Stepβ€Šβ€”β€Š12 and Stepβ€Šβ€”β€Š16 to find the final results of the partial derivative.

Figureβ€Šβ€”β€Š18: Finding the partial derivative of the cost function w.r.t.Β weights

Now, since we want to have n values of weight, we will remove the summation part from the above equation.

Figureβ€Šβ€”β€Š19: Finding the partial derivative of the cost function w.r.t.Β weights

Please note that here we will have to transpose the first part of the calculations to make it compatible with the matrix multiplication rules.

Dimensions: (m, 1) * (m,Β n)

β€” Taking the transpose of the error partΒ β€”

Dimensions: (1, m) * (m, n) = (1,Β n)

18. Stepβ€Šβ€”β€Š18:

Next, we put all the calculated values in Stepβ€Šβ€”β€Š7 to calculate the gradient rule for updatingΒ Ξ±.

Figureβ€Šβ€”β€Š20: Updating the bias using gradientΒ descent

Dimensions: Ξ± = (1,Β 1)

19. Stepβ€Šβ€”β€Š19:

Next, we put all the calculated values in Stepβ€Šβ€”β€Š7 to calculate the gradient rule for updatingΒ Ξ².

Figureβ€Šβ€”β€Š21: Updating the weights using gradientΒ descent

Please note that we must transpose the error value to make the function compatible with matrix multiplication rules.

Dimensions: Ξ² = (1, n)β€Šβ€”β€Š(1, n) = (1,Β n)

Working Example of the Gradient Descent Algorithm:

Now, let’s take an example to see how the gradient descent algorithm finds the best parameter values.

1. Stepβ€Šβ€”β€Š1:

The input data is shown in the matrix below. Here, we can observe that there are 4 training examples and 2 features.

Figureβ€Šβ€”β€Š22: The inputΒ matrix

2. Stepβ€Šβ€”β€Š2:

The expected output matrix is shown below. Our expected output matrix will be of size 4*1 because we have 4 training examples.

Figureβ€Šβ€”β€Š23: The expectedΒ output

3. Stepβ€Šβ€”β€Š3:

We will add a bias element in our parameters to be trained. Here, we are choosing the initial value of 0 forΒ bias.

Figureβ€Šβ€”β€Š24: The biasΒ element

4. Stepβ€Šβ€”β€Š4:

In our parameters, we have our weight matrix. The weight matrix will have 2 elements. Here, 2 is the number of features of our training dataset. Initially, we can choose any random numbers for the weightsΒ matrix.

Figureβ€Šβ€”β€Š25: The weightsΒ matrix

5. Stepβ€Šβ€”β€Š5:

Next, we are going to predict the values using our input matrix, weight matrix, andΒ bias.

Figureβ€Šβ€”β€Š26: The predicted values

6. Stepβ€Šβ€”β€Š6:

Next, we calculate the cost using the following equation.

Figureβ€Šβ€”β€Š27: Calculating the cost in prediction

7. Stepβ€Šβ€”β€Š7:

Next, we are calculating the partial derivative of the cost function with respect to the bias element. We’ll use this result in the gradient descent algorithm to update the value of the bias parameter.

Figureβ€Šβ€”β€Š28: The partial derivative of the cost function w.r.t. the biasΒ element

8. Stepβ€Šβ€”β€Š8:

Next, we are calculating the partial derivative of the cost function with respect to the weights matrix. We’ll use this result in the gradient descent algorithm to update the value of the weightsΒ matrix.

Figureβ€Šβ€”β€Š29: The partial derivative of the cost function w.r.t the weightsΒ matrix

9. Stepβ€Šβ€”β€Š9:

Next, we are defining the value of the learning rate. Learning rate is the parameter which controls the speed of how fast our modelΒ learns.

Figureβ€Šβ€”β€Š30: The learningΒ rate

10. Stepβ€Šβ€”β€Š10:

Next, we are using the gradient descent rule to update the parameter value of the biasΒ element.

Figureβ€Šβ€”β€Š31: Updating the value of the bias element using the gradient descent algorithm

11. Stepβ€Šβ€”β€Š11:

Next, we are using the gradient descent rule to update the parameter values of the weightsΒ matrix.

Figureβ€Šβ€”β€Š32: Updating the value of the weights matrix using the gradient descent algorithm

12. Stepβ€Šβ€”β€Š12:

Now, we repeat this process for a number of iterations to find the best parameters for our model. In each iteration, we use the updated values of our parameters.

End Notes:

So, this is how we find the updated rules using the gradient descent algorithm for the mean squared error. We hope this sparked your curiosity and made you hungry for more machine learning knowledge. We will use the rules we derived here to implement the gradient descent algorithm in future blogs, so don’t miss the third installment in the Gradient Descent series where it all comes togetherβ€Šβ€”β€Šthe grandΒ finale!

Buy Pratik aΒ Coffee!

Citation:

For attribution in academic contexts, please cite this workΒ as:

Shukla, et al., β€œMathematical Intuition behind the Gradient Descent Algorithm”, Towards AI, 2022

BibTex Citation:

@article{pratik_2022, 
title={Mathematical Intuition behind the Gradient Descent Algorithm},
url={https://towardsai.net/neural-networks-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla},
editor={Lauren, Keegan},
year={2022},
month={Oct}
}

References and Resources:

  1. Gradient descentβ€Šβ€”β€ŠWikipedia


Mathematical Intuition behind the Gradient Descent Algorithm was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓