Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Gradient Descent Algorithm Explained
Machine Learning

Gradient Descent Algorithm Explained

Last Updated on June 21, 2020 by Editorial Team

Author(s): Pratik Shukla

Machine Learning

With Step-By-Step Mathematical Derivation

Source: Unsplash

Index:

  • Basics Of GradientΒ Descent.
  • Basic Rules Of Derivation.
  • Gradient Descent With One Variable.
  • Gradient Descent With Two Variables.
  • Gradient Descent For Mean Squared Error Function.

What is GradientΒ Descent?

Gradient Descent is a machine learning algorithm that operates iteratively to find the optimal values for its parameters. It takes into account, user-defined learning rate, and initial parameter values.

How does itΒ work?

  • Start with initialΒ values.
  • Calculate cost.
  • Update values using the update function.
  • Returns minimized cost for our costΒ function

Why do we needΒ it?

Generally, what we do is, we find the formula that gives us the optimal values for our parameter. But in this algorithm, it finds the value by itself! Interesting, isn’tΒ it?

Formula:

Gradient DescentΒ Formula

Some Basic Rules For Derivation:

( A ) Scalar MultipleΒ Rule:

Source: Image created by theΒ author.

( B ) SumΒ Rule:

Source: Image created by theΒ author.

( C ) PowerΒ Rule:

Source: Image created by theΒ author.

( D ) ChainΒ Rule:

Source: Image created by theΒ author.

Let’s have a look at various examples to understand itΒ better.

Gradient Descent Minimizationβ€Šβ€”β€ŠSingle Variable:

We’re going to be using gradient descent to find ΞΈ that minimizes the cost. But let’s forget the Mean Squared Error (MSE) cost function for a moment and take a look at gradient descent function inΒ general.

Now what we generally do is, find the best value of our parameters using some sort of simplification and make a function that will give us minimized cost. But here what we’ll do is take some default or random for our parameters and let our program run iteratively to find the minimized cost.

Let’s Explore It In-Depth:

Let’s take a very simple function to begin with: J(ΞΈ) = ΞΈΒ², and our goal is to find the value of ΞΈ which minimizes J(ΞΈ).

From our cost function, we can clearly say that it will be minimum for ΞΈ= 0, but it won’t be so easy to derive such conclusions while working with some complex functions.

( A ) Cost function: We’ll try to minimize the value of this function.

Source: Image created by theΒ author.

( B ) Goal: To minimize the cost function.

Source: Image created by theΒ author.

( C ) Update Function: Initially we take a random number for our parameters, which are not optimal. To make it optimal we have to update it at each iteration. This function takes care ofΒ it.

Source: Image created by theΒ author.

( D ) Learning rate: The descentΒ speed.

Source: Image created by theΒ author.

( E ) Updating Parameters:

Source: Image created by theΒ author.

( F ) Table Generation:

Here we are stating with ΞΈ =Β 5.

keep in mind that here ΞΈ = 0.8*ΞΈ, for our learning rate and cost function.

Source: Image created by theΒ author.
Source: Image created by theΒ author.

Here we can see that as our ΞΈ is decreasing the cost value is also decreasing. We just have to find the optimal value for it. To find the optimal value we have to do perform many iterations. The more the iterations, the more optimal value weΒ get!

( G ) Graph: We can plot the graph of the aboveΒ points.

Source: Image created by theΒ author.

Cost Function Derivative:

Why does the gradient descent use the derivative of the cost function? We want our cost function to be minimum, right? Minimizing the cost function simply gives us a lower error rate in predicting values. Ideally, we take the derivative of a function to 0 and find the parameters. Here we do the same thing but we start from a random number and try to minimize it iteratively.

The learning rate /Β ALPHA:

The learning rate gives us solid control over how large of steps we make. Selecting the right learning rate is a very critical task. If the learning rate is too high then you might overstep the minimum and diverge. For example, in the above example if we take alpha =2 then each iteration will take us away from the minimum. So we use small alpha values. But the only concern with using a small learning rate is we have to perform more iteration to reach the minimum cost value, this increases trainingΒ time.

Convergence / Stopping gradientΒ descent:

Note that in the above example the gradient descent will never actually converge to a minimum of theta= 0. Methods for deciding when to stop our iterations are beyond my level of expertise. But I can tell you that while doing assignments we can take a fixed number of iterations like 100 orΒ 1000.

Gradient Descentβ€Šβ€”β€ŠMultiple Variables:

Our ultimate goal is to find the parameters for MSE function, which includes multiple variables. So here we will discuss a cost function which as 2 variables. Understanding this will help us very much in our MSE Cost function.

Let’s take this function:

Source: Image created by theΒ author.

When there are multiple variables in the minimization objective, we have to define separate rules for update function. With more than one parameter in our cost function, we have to use partial derivative. Here I simplified the partial derivative process. Let’s have a look atΒ this.

( A ) Cost Function:

Source: Image created by theΒ author.

( B )Β Goal:

Source: Image created by theΒ author.

( C ) UpdateΒ Rules:

Source: Image created by theΒ author.

( D ) Derivatives:

Source: Image created by theΒ author.
Source: Image created by theΒ author.

( E ) UpdateΒ Values:

Source: Image created by theΒ author.

( F ) LearningΒ Rate:

Source: Image created by theΒ author.

( G )Β Table:

Starting with ΞΈ1 =1Β ,ΞΈ2 =1. And then updating the value using update functions.

Source: Image created by theΒ author.

( H )Β Graph:

Source: Image created by theΒ author.

Here we can see that as we increase our number of iterations, our cost value is goingΒ down.

Note that while implementing the program in python the new values must not be updated until we find new values for both ΞΈ1 and ΞΈ2. We clearly don’t want to use the new value of ΞΈ1 to be used in the old value ofΒ ΞΈ2.

Gradient Descent For Mean SquaredΒ Error:

Now that we know how to perform gradient descent on an equation with multiple variables, we can return to looking at gradient descent on our MSE cost function.

Let’s getΒ started!

( A ) Hypothesis function:

Source: Image created by theΒ author.

( B ) cost function:

Source: Image created by theΒ author.

( C ) Find partial derivative of J(ΞΈ0,ΞΈ1) w.r.t toΒ ΞΈ1:

Source: Image created by theΒ author.

( D ) Simplify aΒ little:

Source: Image created by theΒ author.

( E ) Define a variableΒ u:

Source: Image created by theΒ author.

( F ) Value ofΒ u:

Source: Image created by theΒ author.

( G ) Finding partial derivative:

Source: Image created by theΒ author.
Source: Image created by theΒ author.

( H ) Rewriting the equations:

Source: Image created by theΒ author.

( I ) Merge all the calculated data:

Source: Image created by theΒ author.

( J ) Repeat the same process for derivation of J(ΞΈ0,ΞΈ1) w.r.tΒ ΞΈ1:

Source: Image created by theΒ author.

( K ) Simplified calculations:

Source: Image created by theΒ author.

( L ) Combine all calculated Data:

Source: Image created by theΒ author.

One Half Mean Squared ErrorΒ :

We multiply our MSE cost function with 1/2, so that when we take the derivative the 2s cancel out. Multiplying the cost function by a scalar does not affect the location of the minimum, so we can get away withΒ this.

FinalΒ :

( A ) Cost Function: One Half Mean SquaredΒ Error:

Source: Image created by theΒ author.

( B )Β Goal:

Source: Image created by theΒ author.

( C ) UpdateΒ Rule:

Source: Image created by theΒ author.

( D ) Derivatives:

Source: Image created by theΒ author.
Source: Image created by theΒ author.

So, that’s it. We finally madeΒ it!

Conclusion:

We are going to use the same method in various applications of machine learning algorithms. But at that time we are not going to go in this depth, we’re just going to use the final formula. But it’s always good to know how it’sΒ derived!

Final Formula:

Gradient DescentΒ Formula

Is the concept lucid to you now? Please let me know by writing responses. If you enjoyed this article then hit the clapΒ icon.

If you have any additional confusions, feel free to contact me. [email protected]


Gradient Descent Algorithm Explained was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓