Gradient-Based Learning In Numpy

Last Updated on September 17, 2024 by Editorial Team

Author(s): Shashank Bhushan

Originally published on Towards AI.

What is Gradient-Based Learning?

Mathematically training a neural network can be framed as an optimization problem in which we either try to maximize or minimize some function f(x). f(x) is usually referred to as the cost function, loss function, or error function. Common examples are Mean Squared Error for regression problems and Cross Entropy for classification problems. Maximization of f(x) can be framed as a minimization problem by changing the objective function to –f(x), which is what is done in practice. So from now on, we will only discuss minimizing f(x).

As the loss function f is fixed, minimizing f(x) becomes finding x that minimizes f(x). Gradient Descent is a gradient-based learning/optimization algorithm that can be used to find this min value. It's the most popular way to train Neural Networks. To understand how Gradient Descent works, we first need to review a bit of calculus. For a given function y = f(x), the derivative or gradient denoted by f’(x) describes the slope of f(x) at x. The slope at a given point x allows us to estimate how small changes to x change f(x):

If it’s unclear why the above formulation works, consider the following example.

For the function specified by the blue curve (figure on the left), we want to know how small changes to the x (specified by the red dot) change f(x), i.e. what will be the value of x+ε on the blue curve? The orange line specifies the slope at x. Now, if we were to zoom in a lot at x (figure on the right) the blue curve would turn into a straight line and this line would be the same as the orange line. For any line f(x) = cx +b, the value at x+ε will be cx+b+cε = f(x)+cε, where c is the slope of the line. Thus we can generalize to f(x + ε) ≈ f(x) + εf’(x) for small values ε

Based on the above formulation, the derivate/slope can be used to make small changes to x, x’ = x+ε, such that f(x’) < f(x).

We can thus reduce f(x) by moving x in small steps in the opposite direction of the gradient or slope. This is the Gradient Descent technique. Formally, gradient descent is an iterative process in which we start with a random value of x and at each step update it by:

The term γ controls how big of a step we take along the derivate/slope. The term is commonly known as the learning rate of the algorithm.

Note: We need the loss functions to be convex for gradient descent to work.

Using Gradient Descent To Solve a System of Linear Equations

A system of linear equations can be seen as a simple single layer Neural Network (without the bias and activation). So solving one using gradient descent allows us to see gradient descent in practice without getting distracted by the other components of a Neural Network. With this out of the way, suppose we are given the following system of linear equations:

We want to find x₁, x₂, and x₃ such that the left side of the equation sums up to the right side.

Note: x₁, x₂, and x₃ are generally referred to as weights and denoted by the term w. I am using x instead of w to be consistent with the notation above.

To solve this using Gradient Descent, we first need to define the loss or the objective function that should be optimized/reduced. Mean Squared Error (MSE) is a good option here. MSE is given by:

Here Yᵢ is the expected output (1, 4, or 0) and Ŷᵢ is the value of the LHS of the equations. We will use Ax = Ŷ to denote the set of equations to simplify the notations. Now to use gradient descent we need to compute the derivative of the loss wrt. to x:

While the gradient of the loss wrt. to x can be directly computed here, I am breaking it down into the gradient of loss wrt to Ŷ and then the gradient of Ŷ wrt to x (using chain rule) as that is how it is usually done in code as it allows for automatic gradient computation. Now let’s calculate the two terms. I am ignoring the summation term to simplify things, but we need to apply the summation and averaging before updating the values of x.

Putting it all together, the following is a code snippet of how to use gradient descent to solve the system of equations

One thing to note above is that the shape of loss_wrt_x or dMSE/dx is the same as X. This means that the gradient is learned and applied per dimension, though the update step i.e. Learning Rate or ε is the same for all dimensions. While the above is the basic Gradient Descent setup, there are advanced gradient descent algorithms such as AdaGrad and ADAM which try to learn different update steps for different dimensions. For more information about advanced gradient-based algorithms, refer to this excellent article

Running the above function on the set of equations from earlier we get.

A = np.array([[5, 1, -1], [2, -1, 1], [1, 3, -2]])
Y = np.array([1, 4, 0])
x = sovleUsingGradient(A, Y, lr=0.01, iters=10000)

Loss at iter: 0: 5.666666666666667
Loss at iter: 1000: 0.5797261956378661
Loss at iter: 2000: 0.13383958398415846
Loss at iter: 3000: 0.0308991285469566
Loss at iter: 4000: 0.007133585719112376
Loss at iter: 5000: 0.0016469087513129724
Loss at iter: 6000: 0.0003802167019434473
Loss at iter: 7000: 8.77794475992898e-05
Loss at iter: 8000: 2.0265368095217914e-05
Loss at iter: 9000: 4.678602511941347e-06

x
array([0.71418766, 4.42539244, 6.99559903])

A.dot(x)
array([ 1.00073170e+00, 3.99858189e+00, -8.33062319e-04])

As we can see the loss goes down over time and after 10000 iterations Ax gets close to Y.

Building a Multi-Layer Perceptron

Let’s now see how gradient descent is used in actual neural network training. For this, we will build a Multi-Layer Perceptron or MLP on a simplified MNIST dataset with only 2 classes (0 and 1) instead of 10. The following figure captures the overall architecture that we will be building.

Fig: MLP Setup for Determining if Digit is 0 or 1, figure by author

We will go over the different components mentioned in the diagram above one by one.

Binary Cross Entropy

As this is a binary classification problem we will use the binary cross entropy loss. Following is the definition of binary cross-entropy

Similar to MSE in order to train a model using this loss we need to compute its gradient:

Writing this out in python/numpy:

Sigmoid Activation

Another new component is the sigmoid activation function. Activation functions in general are meant to add non-linearity to a Neural Network. Sigmoid Activation function is defined as:

Here again, we need to compute the derivative of the function in order to use it in gradient descent/backpropagation.

Note: σ(x) will be the same as Ŷ in the binary cross entropy formula. So the combined gradient of the sigmoid and BinaryCrossEntropy becomes Ŷ-Y, which is the same as the gradient of MSE. As the sigmoid function is responsible for converting the logits/neural network outputs into probability values for the cross-entropy function, the similarity between the two gradient formulations is interesting. It implies that when it comes to learning the weights there isn’t any difference between a regression and a classification problem.

Writing this out in python/numpy:

Dense Layer

The final component is the Dense Layer. A dense layer is defined as:

Here w is the weight matrix, X is the input, and b is the bias term. Mathematically the representation is very similar to that of a system of linear equations. So the derivative needed to update the weights will be the same. There are however two things worth discussing here:

Weight Initialization: Unlike the system of linear equations, the choice of initial weights plays a big role in training convergence i.e. in the ability of gradient descent to work. Generally, unit norm initialization gives the best result. For a more in-depth discussion on weight initialization refer to this excellent article.
Automatic Gradient Computation: In an MLP there can be multiple dense layers stacked on top of each other. Mathematically, the gradient for any layer represented by x whose output is Ŷ can be written as.

The latter part of the multiplication is easy to compute. For the first part dLoss / dŶ we essentially need the combined gradient for all the layers after the given layer. As Ŷ is the input to the next layer this means each layer needs to return the derivative of its output wrt to its input multiplied by the combined gradient of the layers above it

Putting this all together in python/numpy:

We now have all the pieces we need to build and train an MLP. I am leaving the final stitching of all the components together as an exercise. If you however are stuck or are unsure about anything you can refer to the full implementation below:

numpynet/mlp.py at main · monkeydunkey/numpynet

Neural Net implementation in pure numpy and python – numpynet/mlp.py at main · monkeydunkey/numpynet

github.com

Additional References

Finally here are some additional references for gradient descent and multi-layer perceptrons.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Gradient-Based Learning In Numpy

Author(s): Shashank Bhushan

What is Gradient-Based Learning?

Using Gradient Descent To Solve a System of Linear Equations

Building a Multi-Layer Perceptron

Binary Cross Entropy

Sigmoid Activation

Dense Layer

numpynet/mlp.py at main · monkeydunkey/numpynet

Neural Net implementation in pure numpy and python – numpynet/mlp.py at main · monkeydunkey/numpynet

Additional References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Gradient-Based Learning In Numpy

Author(s): Shashank Bhushan

What is Gradient-Based Learning?

Using Gradient Descent To Solve a System of Linear Equations

Building a Multi-Layer Perceptron

Binary Cross Entropy

Sigmoid Activation

Dense Layer

numpynet/mlp.py at main · monkeydunkey/numpynet

Neural Net implementation in pure numpy and python – numpynet/mlp.py at main · monkeydunkey/numpynet

Additional References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥