Understanding Gradient Descent: How Machines Learn Step by Step

Last Updated on September 23, 2025 by Editorial Team

Author(s): Aditya Gupta

Originally published on Towards AI.

Gradient Descent Explained the Easiest Way, A Beginner’s Guide You’ll Actually Remember

Have you ever played Angry Birds?

What is the first thing you do when you start the game?

If you are a beginner, you might just estimate a shot and try to hit the piggies. If you miss, you notice how far off you were and adjust your aim and force for the next try. You repeat this process, learning from each attempt, until you finally hit the piggies or achieve your goal.

Understanding Gradient Descent: How Machines Learn Step by Step — Angry bird projectile hitting the pigs

Think about hitting the piggies in Angry Birds. Whether you succeed depends mainly on two things: the force of your shot and the angle of your aim. These are the key factors that control the outcome. In machine learning terms, we can call these parameters, the variables the machine adjusts to get the desired result.

We can write a simple equation to represent this idea:

Hit Success = f(Force, Angle)

Here, Force and Angle are the parameters, and Hit Success is the outcome, whether you hit the piggies or how close you get. Initially, you don’t know the perfect combination of force and angle, so you start with a guess. Each time you shoot and miss, you measure how far off you were. This feedback is like the machine calculating an error, which tells it how to adjust the parameters in the next attempt.

Just like adjusting force and angle step by step in Angry Birds helps you eventually hit the piggies, gradient descent allows the machine to adjust its parameters step by step to reach the desired output.

How does this work in machine terms ?

Now, when playing the game, your brain naturally analyzes how far you are from hitting the piggies. That distance, how far off your shot was, is essentially your loss. In machine learning, this is exactly what a loss function does. It measures how far the machine’s prediction is from the desired output. This loss is vital for gradient descent because it tells the machine how much and in which direction to adjust its parameters to improve the next attempt.
To learn more about loss functions, check out my earlier article

Gradient Descent

So far, we have not yet looked at how gradient descent actually works in a machine. Now, let’s focus on the method itself and understand how it helps a model find the best parameters to minimize the loss.

Gradient descent is at the heart of machine learning. It is an iterative method used to find the values of parameters that minimize the loss function. In other words, it tells the machine how to adjust its parameters to make better predictions.

The key idea behind gradient descent is the gradient, which is the derivative of the loss function with respect to a parameter. Just like if your target was way off in angry birds, you would make larger changes, similarly the derivative tells us how fast the loss is changing at the current value of the parameter and in which direction the loss is increasing. By knowing this, we can move the parameter in the opposite direction to reduce the loss.

Mathematically, we can write the update for a parameter θ as:

Here,

θ is the parameter we want to optimize,

L(θ) is the loss function,

dL/dθ is the derivative of the loss with respect to θ, also called the gradient,

α is the learning rate, which controls the size of each step.

Let’s break this down:

dL/dθ tells us the slope of the loss function at the current parameter value. A positive slope means increasing the parameter increases the loss, while a negative slope means increasing the parameter decreases the loss.

We subtract α * dL/dθ because we want to move in the direction that reduces the loss, not increases it.

The learning rate α controls how big each adjustment is. Too small, and the process will take a long time. Too large, and we might overshoot the minimum.

The process is repeated for many iterations. Each time, the parameters are adjusted slightly based on the gradient, the loss is recalculated, and the parameters move closer to the values that minimize the loss. Eventually, the machine reaches a point where further changes do not significantly reduce the loss. This point is called convergence.

In short, gradient descent is like a step-by-step guide that tells the machine how to improve its predictions by constantly checking the slope of the loss function and adjusting parameters in the best possible direction.

Visualization

How gradient descent works, to reach the minima

This image shows a 3D view of a loss function with respect to two parameters, X and Y. The vertical axis, Z, represents the loss. The shape you see is like a bowl, where the lowest point at the bottom of the bowl represents the minimum loss, meaning the optimal values for the parameters.

Gradient descent works by starting at a random point on this surface and then moving step by step in the direction that reduces the loss. You can imagine a ball rolling down the surface, it will gradually move toward the lowest point. Each step is guided by the slope of the surface at that point, which is calculated using derivatives.

The colors in the image also help you visualize the height of the surface: warmer colors like red show higher loss, and cooler colors like blue show lower loss. The goal of gradient descent is to adjust the parameters until the ball reaches the blue region at the bottom, achieving the best possible outcome.

This 3D visualization helps make it easier to understand how gradient descent navigates complex surfaces with multiple parameters, even though the actual models can have hundreds or thousands of parameters.

Coding this ourselves

Before we dive into the code, let’s understand what it does. We’re going to implement gradient descent in Python to find the best parameter for a simple linear relationship, y = 2x. Think of it as teaching the computer to guess the correct slope step by step.

# Simple gradient descent to fit y = 2x
import numpy as np
import matplotlib.pyplot as plt

# Our dataset
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 6, 8, 10]) # True relationship: y = 2x

# Initialize parameter
theta = 0.0

# Hyperparameters
learning_rate = 0.01
iterations = 20

# Store loss values for plotting
loss_history = []

# Gradient Descent Loop
for i in range(iterations):
 predictions = theta * X
 error = predictions - Y
 loss = np.mean(error**2) # Mean Squared Error
 loss_history.append(loss)
 
 gradient = (2/len(X)) * np.dot(error, X) # Derivative of MSE w.r.t theta
 theta = theta - learning_rate * gradient # Update theta
 
 print(f"Iteration {i+1}: theta = {theta:.4f}, loss = {loss:.4f}")

# Plot how loss decreases over iterations
plt.plot(range(iterations), loss_history, marker='o')
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.title("Gradient Descent: Loss over Iterations")
plt.show()

Start with a guess: We initialize a parameter, theta, with a random value. This is our starting point, just like your first shot in Angry Birds.
Predict and calculate error: Using the current theta, the code predicts outputs for all input values and measures the error — how far the predictions are from the actual results.
Compute the gradient: The gradient tells us the direction and size of the step we need to take to reduce the error. If the gradient is positive, we need to decrease theta; if it’s negative, we need to increase it.
Update the parameter: Using the learning rate, the code adjusts theta slightly in the direction that reduces the error. This is repeated over many iterations.
Track progress: We calculate and store the loss at each step so we can visualize how the error decreases over time.
Visualize the result: Finally, we plot the loss over iterations to see how gradient descent gradually helps theta approach the correct value.

Plotted Loss vs Iterations showing gradient descent’s working

Learning Rate and Its Impact

The learning rate controls how big each step is in gradient descent. If it is too small, learning is very slow. If it is too large, the algorithm can overshoot the minimum and fail to converge.

Think of it like a ball rolling down a hill. A small step size moves slowly, a large step size jumps over the bottom, and the right step size moves steadily toward the minimum.

A simple plot can show this clearly: small learning rates decrease the loss slowly, optimal rates drop quickly and smoothly, and large rates fluctuate without settling.

Types of Gradient Descent

1. Batch Gradient Descent

How it works: Uses all the data points to calculate the gradient and update the parameters.

Analogy: Imagine you play Angry Birds and analyze every single previous shot before deciding how to adjust your next aim. You calculate the average error of all shots before making a move.

Pros:

Very stable and accurate updates
Gradually moves toward the minimum without overshooting

Cons:

Very slow for large datasets
Requires a lot of memory and computation

2. Stochastic Gradient Descent (SGD)

How it works: Updates the parameters after every single data point instead of using all the data.

Analogy: You adjust your aim in Angry Birds after just one shot, without considering the other previous shots. You learn quickly but the updates can be jumpy or inconsistent.

Pros:

Fast updates, can start improving the model immediately
Works well for very large datasets

Cons:

The path to the minimum is noisy and fluctuates
Can overshoot or bounce around the optimal value

3. Mini-Batch Gradient Descent

How it works: Uses a small batch of data points to calculate the gradient and update the parameters.

Analogy: You adjust your aim in Angry Birds using the average of your last few shots, not all shots and not just one. This gives a balance between speed and stability.

Pros:

Faster than batch gradient descent
More stable than SGD
Most commonly used in real-world applications

Cons:

Slightly more complex to implement than SGD
Needs careful selection of batch size

Conclusion

Gradient descent is the backbone of how machines learn. Just like adjusting your aim in Angry Birds, the algorithm takes small steps, learns from mistakes, and gradually improves. By understanding the concepts of loss, learning rate, and types of gradient descent, you now know how machines figure out the best parameters to make accurate predictions.

Remember these key points:

The learning rate controls the step size and affects how quickly or smoothly the model learns.
Batch, stochastic, and mini-batch gradient descent offer different ways to update parameters depending on dataset size and speed requirements.
Gradient descent is a step-by-step process, and even small adjustments eventually lead to the optimal solution.

Once you grasp this, you’re ready to explore more advanced topics, like training neural networks, using momentum, and optimizing large-scale models. The principles are the same, it’s all about learning gradually from errors and moving toward the best outcome.

To visualize this :

To get started in AI and roadmap: AI roadmap

Follow for more explanations and to make artificial intelligence feel real !

“Some men go through a forest and see no firewood.” — English proverb

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Understanding Gradient Descent: How Machines Learn Step by Step

Author(s): Aditya Gupta

Gradient Descent Explained the Easiest Way, A Beginner’s Guide You’ll Actually Remember

How does this work in machine terms ?

Gradient Descent

Visualization

Coding this ourselves

Learning Rate and Its Impact

Types of Gradient Descent

1. Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Built a Clinical AI Agent — and It Skipped the Tools I Gave It

ATOKEN: A Unified Tokenizer for Vision Finally Solves AI’s Biggest Problem

How to Model APIs with Ontologies and Graphs for AI Agents

From A/B Testing to DoubleML: A Data Scientist’s Guide to Causal Inference:

RAG-Fusion Multimodal: The Theory Behind Local Document Intelligence

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Understanding Gradient Descent: How Machines Learn Step by Step

Author(s): Aditya Gupta

Gradient Descent Explained the Easiest Way, A Beginner’s Guide You’ll Actually Remember

How does this work in machine terms ?

Gradient Descent

Visualization

Coding this ourselves

Learning Rate and Its Impact

Types of Gradient Descent

1. Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement