
Understanding Gradient Descent: How Machines Learn Step by Step
Last Updated on September 23, 2025 by Editorial Team
Author(s): Aditya Gupta
Originally published on Towards AI.
Gradient Descent Explained the Easiest Way, A Beginner’s Guide You’ll Actually Remember
Have you ever played Angry Birds?
What is the first thing you do when you start the game?
If you are a beginner, you might just estimate a shot and try to hit the piggies. If you miss, you notice how far off you were and adjust your aim and force for the next try. You repeat this process, learning from each attempt, until you finally hit the piggies or achieve your goal.

Think about hitting the piggies in Angry Birds. Whether you succeed depends mainly on two things: the force of your shot and the angle of your aim. These are the key factors that control the outcome. In machine learning terms, we can call these parameters, the variables the machine adjusts to get the desired result.
We can write a simple equation to represent this idea:
Hit Success = f(Force, Angle)
Here, Force and Angle are the parameters, and Hit Success is the outcome, whether you hit the piggies or how close you get. Initially, you don’t know the perfect combination of force and angle, so you start with a guess. Each time you shoot and miss, you measure how far off you were. This feedback is like the machine calculating an error, which tells it how to adjust the parameters in the next attempt.
Just like adjusting force and angle step by step in Angry Birds helps you eventually hit the piggies, gradient descent allows the machine to adjust its parameters step by step to reach the desired output.
How does this work in machine terms ?
Now, when playing the game, your brain naturally analyzes how far you are from hitting the piggies. That distance, how far off your shot was, is essentially your loss. In machine learning, this is exactly what a loss function does. It measures how far the machine’s prediction is from the desired output. This loss is vital for gradient descent because it tells the machine how much and in which direction to adjust its parameters to improve the next attempt.
To learn more about loss functions, check out my earlier article
Gradient Descent
So far, we have not yet looked at how gradient descent actually works in a machine. Now, let’s focus on the method itself and understand how it helps a model find the best parameters to minimize the loss.
Gradient descent is at the heart of machine learning. It is an iterative method used to find the values of parameters that minimize the loss function. In other words, it tells the machine how to adjust its parameters to make better predictions.
The key idea behind gradient descent is the gradient, which is the derivative of the loss function with respect to a parameter. Just like if your target was way off in angry birds, you would make larger changes, similarly the derivative tells us how fast the loss is changing at the current value of the parameter and in which direction the loss is increasing. By knowing this, we can move the parameter in the opposite direction to reduce the loss.
Mathematically, we can write the update for a parameter θ as:

Here,
θ is the parameter we want to optimize,
L(θ) is the loss function,
dL/dθ is the derivative of the loss with respect to θ, also called the gradient,
α is the learning rate, which controls the size of each step.
Let’s break this down:
dL/dθ tells us the slope of the loss function at the current parameter value. A positive slope means increasing the parameter increases the loss, while a negative slope means increasing the parameter decreases the loss.
We subtract α * dL/dθ because we want to move in the direction that reduces the loss, not increases it.
The learning rate α controls how big each adjustment is. Too small, and the process will take a long time. Too large, and we might overshoot the minimum.
The process is repeated for many iterations. Each time, the parameters are adjusted slightly based on the gradient, the loss is recalculated, and the parameters move closer to the values that minimize the loss. Eventually, the machine reaches a point where further changes do not significantly reduce the loss. This point is called convergence.
In short, gradient descent is like a step-by-step guide that tells the machine how to improve its predictions by constantly checking the slope of the loss function and adjusting parameters in the best possible direction.
Visualization

This image shows a 3D view of a loss function with respect to two parameters, X and Y. The vertical axis, Z, represents the loss. The shape you see is like a bowl, where the lowest point at the bottom of the bowl represents the minimum loss, meaning the optimal values for the parameters.
Gradient descent works by starting at a random point on this surface and then moving step by step in the direction that reduces the loss. You can imagine a ball rolling down the surface, it will gradually move toward the lowest point. Each step is guided by the slope of the surface at that point, which is calculated using derivatives.
The colors in the image also help you visualize the height of the surface: warmer colors like red show higher loss, and cooler colors like blue show lower loss. The goal of gradient descent is to adjust the parameters until the ball reaches the blue region at the bottom, achieving the best possible outcome.
This 3D visualization helps make it easier to understand how gradient descent navigates complex surfaces with multiple parameters, even though the actual models can have hundreds or thousands of parameters.
Coding this ourselves
Before we dive into the code, let’s understand what it does. We’re going to implement gradient descent in Python to find the best parameter for a simple linear relationship, y = 2x
. Think of it as teaching the computer to guess the correct slope step by step.
# Simple gradient descent to fit y = 2x
import numpy as np
import matplotlib.pyplot as plt
# Our dataset
X = np.array([1, 2, 3, 4, 5])
Y = np.array([2, 4, 6, 8, 10]) # True relationship: y = 2x
# Initialize parameter
theta = 0.0
# Hyperparameters
learning_rate = 0.01
iterations = 20
# Store loss values for plotting
loss_history = []
# Gradient Descent Loop
for i in range(iterations):
predictions = theta * X
error = predictions - Y
loss = np.mean(error**2) # Mean Squared Error
loss_history.append(loss)
gradient = (2/len(X)) * np.dot(error, X) # Derivative of MSE w.r.t theta
theta = theta - learning_rate * gradient # Update theta
print(f"Iteration {i+1}: theta = {theta:.4f}, loss = {loss:.4f}")
# Plot how loss decreases over iterations
plt.plot(range(iterations), loss_history, marker='o')
plt.xlabel("Iteration")
plt.ylabel("Loss")
plt.title("Gradient Descent: Loss over Iterations")
plt.show()
- Start with a guess: We initialize a parameter, theta, with a random value. This is our starting point, just like your first shot in Angry Birds.
- Predict and calculate error: Using the current theta, the code predicts outputs for all input values and measures the error — how far the predictions are from the actual results.
- Compute the gradient: The gradient tells us the direction and size of the step we need to take to reduce the error. If the gradient is positive, we need to decrease theta; if it’s negative, we need to increase it.
- Update the parameter: Using the learning rate, the code adjusts theta slightly in the direction that reduces the error. This is repeated over many iterations.
- Track progress: We calculate and store the loss at each step so we can visualize how the error decreases over time.
- Visualize the result: Finally, we plot the loss over iterations to see how gradient descent gradually helps theta approach the correct value.

Learning Rate and Its Impact
The learning rate controls how big each step is in gradient descent. If it is too small, learning is very slow. If it is too large, the algorithm can overshoot the minimum and fail to converge.
Think of it like a ball rolling down a hill. A small step size moves slowly, a large step size jumps over the bottom, and the right step size moves steadily toward the minimum.
A simple plot can show this clearly: small learning rates decrease the loss slowly, optimal rates drop quickly and smoothly, and large rates fluctuate without settling.
Types of Gradient Descent
1. Batch Gradient Descent
How it works: Uses all the data points to calculate the gradient and update the parameters.
Analogy: Imagine you play Angry Birds and analyze every single previous shot before deciding how to adjust your next aim. You calculate the average error of all shots before making a move.
Pros:
- Very stable and accurate updates
- Gradually moves toward the minimum without overshooting
Cons:
- Very slow for large datasets
- Requires a lot of memory and computation
2. Stochastic Gradient Descent (SGD)
How it works: Updates the parameters after every single data point instead of using all the data.
Analogy: You adjust your aim in Angry Birds after just one shot, without considering the other previous shots. You learn quickly but the updates can be jumpy or inconsistent.
Pros:
- Fast updates, can start improving the model immediately
- Works well for very large datasets
Cons:
- The path to the minimum is noisy and fluctuates
- Can overshoot or bounce around the optimal value
3. Mini-Batch Gradient Descent
How it works: Uses a small batch of data points to calculate the gradient and update the parameters.
Analogy: You adjust your aim in Angry Birds using the average of your last few shots, not all shots and not just one. This gives a balance between speed and stability.
Pros:
- Faster than batch gradient descent
- More stable than SGD
- Most commonly used in real-world applications
Cons:
- Slightly more complex to implement than SGD
- Needs careful selection of batch size
Conclusion
Gradient descent is the backbone of how machines learn. Just like adjusting your aim in Angry Birds, the algorithm takes small steps, learns from mistakes, and gradually improves. By understanding the concepts of loss, learning rate, and types of gradient descent, you now know how machines figure out the best parameters to make accurate predictions.
Remember these key points:
- The learning rate controls the step size and affects how quickly or smoothly the model learns.
- Batch, stochastic, and mini-batch gradient descent offer different ways to update parameters depending on dataset size and speed requirements.
- Gradient descent is a step-by-step process, and even small adjustments eventually lead to the optimal solution.
Once you grasp this, you’re ready to explore more advanced topics, like training neural networks, using momentum, and optimizing large-scale models. The principles are the same, it’s all about learning gradually from errors and moving toward the best outcome.
To get started in AI and roadmap: AI roadmap
Follow for more explanations and to make artificial intelligence feel real !
“Some men go through a forest and see no firewood.” — English proverb
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.