Mathematical Intuition behind the Gradient Descent Algorithm
Last Updated on October 24, 2022 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Deriving Gradient Descent Algorithm for Mean SquaredΒ Error
Author(s): PratikΒ Shukla
βThe mind is not a vessel to be filled, but a fire to be kindled.ββββPlutarch
The Gradient Descent Series ofΒ Blogs:
- The Gradient Descent Algorithm
- Mathematical Intuition behind the Gradient Descent Algorithm (You areΒ here!)
- The Gradient Descent Algorithm & itsΒ Variants
Table of contents:
- Introduction
- Derivation of the Gradient Descent Algorithm for Mean SquaredΒ Error
- Working Example of the Gradient Descent Algorithm
- End Notes
- References and Resources
Introduction:
Welcome! Today, weβre working on developing a strong mathematical intuition on how the Gradient Descent algorithm finds the best values for its parameters. Having this sense can help you catch mistakes in machine learning outputs and get even more comfortable with how the gradient descent algorithm makes machine learning so powerful. In the pages to follow, we will derive the equation of the gradient descent algorithm for the mean squared error function. We will use the results of this blog to code the gradient descent algorithm. Letβs dive intoΒ it!
Derivation of the Gradient Descent Algorithm for Mean SquaredΒ Error:
1. Stepβββ1:
The input data is shown in the matrix below. Here, we can observe that there are m training examples and n number of features.
Dimensions: X = (m,Β n)
2. Stepβββ2:
The expected output matrix is shown below. Our expected output matrix will be of size m*1 because we have m training examples.
Dimensions: Y = (m,Β 1)
3. Stepβββ3:
We will add a bias element in our parameters to beΒ trained.
Dimensions: Ξ± = (1,Β 1)
4. Stepβββ4:
In our parameters, we have our weight matrix. The weight matrix will have n elements. Here, n is the number of features of our trainingΒ dataset.
Dimensions: Ξ² = (1,Β n)
5. Stepβββ5:
The predicted values for each training examples are givenΒ by,
Please note that we are taking the transpose of the weights matrix (Ξ²) to make the dimensions compatible with matrix multiplication rules.
Dimensions: predicted_value = (1, 1) + (m, n) * (1,Β n)
β Taking the transpose of the weights matrix (Ξ²)Β β
Dimensions: predicted_value = (1, 1) + (m, n) * (n, 1) = (m,Β 1)
6. Stepβββ6:
The mean squared error is defined asΒ follows.
Dimensions: cost = scalarΒ function
7. Stepβββ7:
We will use the following gradient descent rule to determine the best parameters in thisΒ case.
Dimensions: Ξ± = (1, 1) & Ξ² = (1,Β n)
8. Stepβββ8:
Now, letβs find the partial derivative of the cost function with respect to the bias elementΒ (Ξ±).
Dimensions: (1,Β 1)
9. Stepβββ9:
Now, we are trying to simplify the above equation to find the partial derivatives.
Dimensions: u = (m,Β 1)
10. Stepβββ10:
Based on Stepβββ9, we can write the cost functionΒ as,
Dimensions: scalarΒ function
11. Stepβββ11:
Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the intercept (Ξ±).
Dimensions: (m,Β 1)
12. Stepβββ12:
Next, we are calculating the first part of the partial derivative of Stepβββ11.
Dimensions: (m,Β 1)
13. Stepβββ13:
Next, we calculate the second part of the partial derivative of Stepβββ11.
Dimensions: scalarΒ function
14. Stepβββ14:
Next, we multiply the results of Stepβββ12 and Stepβββ13 to find the finalΒ results.
Dimensions: (m,Β 1)
15. Stepβββ15:
Next, we will use the chain rule to calculate the partial derivative of the cost function with respect to the weightsΒ (Ξ²).
Dimensions: (1,Β n)
16. Stepβββ16:
Next, we calculate the second part of the partial derivative of Stepβββ15.
Dimensions: (m,Β n)
17. Stepβββ17:
Next, we multiply the results of Stepβββ12 and Stepβββ16 to find the final results of the partial derivative.
Now, since we want to have n values of weight, we will remove the summation part from the above equation.
Please note that here we will have to transpose the first part of the calculations to make it compatible with the matrix multiplication rules.
Dimensions: (m, 1) * (m,Β n)
β Taking the transpose of the error partΒ β
Dimensions: (1, m) * (m, n) = (1,Β n)
18. Stepβββ18:
Next, we put all the calculated values in Stepβββ7 to calculate the gradient rule for updatingΒ Ξ±.
Dimensions: Ξ± = (1,Β 1)
19. Stepβββ19:
Next, we put all the calculated values in Stepβββ7 to calculate the gradient rule for updatingΒ Ξ².
Please note that we must transpose the error value to make the function compatible with matrix multiplication rules.
Dimensions: Ξ² = (1, n)βββ(1, n) = (1,Β n)
Working Example of the Gradient Descent Algorithm:
Now, letβs take an example to see how the gradient descent algorithm finds the best parameter values.
1. Stepβββ1:
The input data is shown in the matrix below. Here, we can observe that there are 4 training examples and 2 features.
2. Stepβββ2:
The expected output matrix is shown below. Our expected output matrix will be of size 4*1 because we have 4 training examples.
3. Stepβββ3:
We will add a bias element in our parameters to be trained. Here, we are choosing the initial value of 0 forΒ bias.
4. Stepβββ4:
In our parameters, we have our weight matrix. The weight matrix will have 2 elements. Here, 2 is the number of features of our training dataset. Initially, we can choose any random numbers for the weightsΒ matrix.
5. Stepβββ5:
Next, we are going to predict the values using our input matrix, weight matrix, andΒ bias.
6. Stepβββ6:
Next, we calculate the cost using the following equation.
7. Stepβββ7:
Next, we are calculating the partial derivative of the cost function with respect to the bias element. Weβll use this result in the gradient descent algorithm to update the value of the bias parameter.
8. Stepβββ8:
Next, we are calculating the partial derivative of the cost function with respect to the weights matrix. Weβll use this result in the gradient descent algorithm to update the value of the weightsΒ matrix.
9. Stepβββ9:
Next, we are defining the value of the learning rate. Learning rate is the parameter which controls the speed of how fast our modelΒ learns.
10. Stepβββ10:
Next, we are using the gradient descent rule to update the parameter value of the biasΒ element.
11. Stepβββ11:
Next, we are using the gradient descent rule to update the parameter values of the weightsΒ matrix.
12. Stepβββ12:
Now, we repeat this process for a number of iterations to find the best parameters for our model. In each iteration, we use the updated values of our parameters.
End Notes:
So, this is how we find the updated rules using the gradient descent algorithm for the mean squared error. We hope this sparked your curiosity and made you hungry for more machine learning knowledge. We will use the rules we derived here to implement the gradient descent algorithm in future blogs, so donβt miss the third installment in the Gradient Descent series where it all comes togetherβββthe grandΒ finale!
Citation:
For attribution in academic contexts, please cite this workΒ as:
Shukla, et al., βMathematical Intuition behind the Gradient Descent Algorithmβ, Towards AI, 2022
BibTex Citation:
@article{pratik_2022,
title={Mathematical Intuition behind the Gradient Descent Algorithm},
url={https://towardsai.net/neural-networks-with-python},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla},
editor={Lauren, Keegan},
year={2022},
month={Oct}
}
References and Resources:
Mathematical Intuition behind the Gradient Descent Algorithm was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI