How To Explain Gradient Descent to Your Mom: Complete Tutorial
Author(s): Igor Novikov
Originally published on Towards AI.
Gradient descent is at the core of most AI/ML techniques. It sounds strange and kinda scary. Descent? Oh man, I hope I wonβt have to jump with a chute out of a plane..U+1F612
Well, worry not, you might have to jump, but only if you want to. Here is the explanation even your 10-year-old nephew can understand.
Letβs imagine you are trying to learn a new skill. Suppose you live in an ancient tribe, your task is to tell people their weight by their height. Scales have not been invented yet. I know it sounds ridiculous, but for simplicity, letβs assume the survival of the tribe depends on this. So a fellow member of the tribe comes and tells you his height, and you, using your wast experience, should tell him how much he weighs in, say, bags of potatoes.
So a John comes and tells you β I am 174 centimeters high. And you donβt like John very much, so you tell him he weighs 100500 bags of potatoes, and he runs away in tearsβ¦ U+1F602 But really, you know from your experience, that this bloke probably weighs around 70 kilos, or 7 bags of potatoes (10 kilos each).
How do you know that? Well, youβve seen a lot of people with the same height and complexion as John so you guess that should be about right.
Now we want to train a system that can do this. For that, we need to simulate this prior experience of seeing different people of different heights and knowing their weight. We do that with training data that looks like this:
This is the average height-to-weight correlation we know is true.
If we plot this data, it looks like this:
There is, obviously, a pattern. The higher the height β the higher the weight. Ok, letβs try to draw a line that represents this insight. Using this line, we can estimate the weight for any given height:
For example, for someone with a height of around 157, the height is around 40 kilos. The function that represents this line β is, basically, our AI model. Sounds suspiciously simple right? Well, it is.U+1F60E
All complexity comes from complex dependencies. Weight-to-height dependency is simple and can be represented with a simple function. But in the real world, many dependencies (or patterns) are very complex and are represented by very complex functions. But letβs stick with ours, for now.
A line function like ours is represented by the following equation:
f(x) = a * x + b
or
y = a * x + b
Where: a is the slope and b is the intercept. Ah, more definitions U+1F92F. But these are very simple:
Intercept is a point where the line crosses the y-axis and slope is the angle between the line and the x-axis. From simple triangle geometry, we know that the angle here equals A/B (or dy/dx) as below:
These two parameters, a and b, are the coefficients or parameters of our model. In big models like OpenAI, there are billions of parameters and the underlying function is different, but the principle is the same.
So, to the training process. We decided in our case we are going to use a simple line to represent the correlation between height and weight. So the function is linear. This is why this is also called linear regression. The regression part of the name is historical, Iβll explain it at the end.
How do we find the correct slope and intercept? We will do that the same way we learn any skill. We start with a random guess, observe the result, and correct the guess accordingly if it is too far from the truth. Kinda similar to the cold and hot game. So we place our line at a random location, like that:
We measure the distances from the line to all of our points, the sum of those distances is the total error of our current line position. It is called the loss. Or goal is to minimize the loss so that the line is located in such a way that the sum of distances is as close to zero as possible. Ideally, the distance is zero, which means all points are on the line β but that is not feasible in our case with our linear function. But we want it to be as small as possible.
Now we have formalized our training objective β we want to find such coefficients a and b that the loss is as small as possible, or in other words:
f(x) = a * x + b
sum ( distances_to_line (a, b) ) -> 0
The formula for the distance D between two points is:
Letβs randomly pick a = 0.3125, and b =14.375.
Given that, letβs pick a point from our training data, say {174 cm, 70kg}. The second point we get from our function y = x *a + b, with a and b, we selected y = 174 * 0.3125 + 14.375 = 40. So the point is {174 cm, 40kg}
The distance is D = sqrt((174β174)Β² + (70β40)Β²) = 30. So we are 30kg off. Not a good model, we need to find better a and b.
We donβt need a square root per se in the above formula. If you think, we use distance to score the error. We can easily use just:
or
(calculated_value - predicted value)Β²
where the calculated value is calculated using our function and the expected is taken from the training set. Less computation and works just as well. That is what is typically used, this method is called the method of least squares.
If we take all the points we have one by one and sum up the distances, we will get the total loss. Now we slightly change a and b, and calculate the loss again. If it is lower than before β we are moving in the right direction. When do we stop changing coefficients, that is β how do we know that a and b we found are good enough? Well, that decision is up to us, but I would say that if the error for each point is small β that is good enough. That is β if our model guesses 70.2 kg for a height of 174 cm, that is very close to 70 kg. If the error is that small for every point β we donβt need to improve the model further.
So if:
total error = (sum of errors/number of points) β€ 0.2
we stop training.
Now we only need to find a way to move the line, by changing a and b, in the direction that minimizes the total error! If we just change it randomly β itβs going to take way too long. Finally, we have come to the gradient descent. Such a long way!
Gradient descent is a method to find a function minimum. Thatβs just what we need, how convenient! U+1F602
In our case β the total error we are going to minimize is a function of a and b, that is:
total_error = f(a, b) = sum(distances_to_line (a, b)) =
distance({x1, y1_train} , {x1, y1}) +
distance({x2, y2_train} , {x2, y2}) +
β¦
+ distance({xn, yn_train} , {xn, yn})
If we try several more random a values and leave b constant for now, we will see that the total error function looks like this:
At certain values of a the total error is lower. A gradient is a slope, like a valley, and since we need a minimum β we need to move down the slope. In other words, we need to descend. We are at some random point we picked at the beginning and the minimum we are looking for is a treasure at the bottom of the valley β we need to go down to get it.
The thing is we actually donβt know where the treasure is in the beginning β otherwise, we would have just picked that point and not a random oneU+1F914. So it looks more like this:
We know we are on the slope somewhere; we know the treasure is somewhere at the bottom, but we donβt know where the bottom is. What we know β we need to go down and, on the way down, look around for the treasure. So we go slowly and look. This is the nature of gradient descent.
It works like this β we take a derivative of the error function at our random point. A derivative is a slope of a tangent line to the function line at this point, or simply speaking, it points in the direction of slope increase/decrease, and its value tells us how steep the slope is. At the minimum, the derivative is zero because itβs a minimum so there is no slope (like at the bottom of the valley β the ground is even):
So we need to move in the direction of the slope decrease, so we add a derivative, multiplied by a step size (called learning rate), to our coefficient and calculate a new total error, and we repeat this process until our total error is less than 0.2. Learning rate is an arbitrary number that is picked via trial and error, but usually, it is small enough so we donβt jump too fast, or this can happen:
Jumping too far can result in landing further from the target than before
So this is the process of gradient descent:
- take a derivative of an error function
- multiply it by the learning rate
- and add this to the parameter we are trying to optimize
- use the updated parameter to calculate a new loss
- if it is bigger than your desired threshold β repeat the process
I would not go into how to calculate a derivative here, it is a basic formula and itβs not that important as long as you understand what a derivative is. A derivative shows how the slope is changing and in which direction:
- Rate of Change: Geometrically, the derivative tells you how steep the graph is at any given point. A large absolute value of the derivative indicates a steep slope (either rising or falling rapidly), while a small absolute value indicates a shallow slope (rising or falling gently).
- Positive or Negative Slope: If the derivative at a point is positive, the function is increasing at that point, meaning the slope of the tangent line is upwards. Conversely, if the derivative is negative, the function is decreasing, and the slope of the tangent line is downwards.
- Zero Derivative: When the derivative at a point is zero, it indicates that the slope of the tangent line is horizontal. Such points are local maxima, local minima, or inflection points of the function.
Gradient descent peculiarities remarks
One thing to note is that gradient descent is not guaranteed to find a global minimum. It can find a local minimum and stop there if stop threshold conditions are met. It kinda depends on factors like function shape, initially selected random parameters, learning rate, and so onβ¦
Regression historical remarks
As I promised here are the historical roots of the the word βregressionβ in βlinear regressionβ has a historical origin. It was first used by Sir Francis Galton, a British polymath, in the context of his studies on heredity. In the late 19th century, Galton was investigating the relationship between the heights of parents and their children. He observed a phenomenon that he termed βregression towards mediocrityβ (now more commonly referred to as βregression to the meanβ).
Galton noticed that, although tall parents tended to have taller-than-average children, these children were generally not as tall as their parents. Similarly, the children of short parents tended to be shorter than average, but not as short as their parents. In other words, the childrenβs heights tended to βregressβ or move towards the average height of the population.
Galtonβs work was later formalized mathematically by Karl Pearson and others, who developed the method of least squares to find the best-fitting line through a set of data points. This method was used to quantify the strength and nature of the relationship between two variables, a process they termed βregression analysis.β
Over time, the term βregressionβ became associated with statistical techniques used to model and analyze the relationship between variables.
You can see an example of the implementation of linear regression here: https://colab.research.google.com/drive/1PkuGT8sK6CLYz5_CheUa21aMSBKCySKY?usp=sharing
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI