# Gradient-Based Learning In Numpy

Last Updated on September 17, 2024 by Editorial Team

**Author(s): Shashank Bhushan**

Originally published on Towards AI.

## What is Gradient-Based Learning?

Mathematically training a neural network can be framed as an optimization problem in which we either try to maximize or minimize some function *f(x)*. *f(x) *is usually referred to as the cost function, loss function, or error function. Common examples are Mean Squared Error for regression problems and Cross Entropy for classification problems. Maximization of *f(x)* can be framed as a minimization problem by changing the objective function to –*f(x), *which* is *what is done in practice. So from now on, we will only discuss minimizing *f(x).*

As the loss function f is fixed, minimizing *f(x) *becomes finding x that minimizes *f(x). *Gradient Descent is a gradient-based learning/optimization algorithm that can be used to find this min value. It's the most popular way to train Neural Networks. To understand how Gradient Descent works, we first need to review a bit of calculus. For a given function *y = f(x), *the derivative or gradient denoted by *fβ(x) *describes the slope of *f(x) *at x. The slope at a given point x allows us to estimate how small changes to *x* change *f(x)*:

If itβs unclear why the above formulation works, consider the following example.

For the function specified by the blue curve (figure on the left), we want to know how small changes to the *x* (specified by the red dot) change *f(x)*, i.e. what will be the value of *x*+Ξ΅ on the blue curve? The orange line specifies the slope at x. Now, if we were to zoom in a lot at x (figure on the right) the blue curve would turn into a straight line and this line would be the same as the orange line. For any line *f(x) = cx +b*, the value at *x+*Ξ΅ will be *cx+b+cΞ΅ = f(x)+cΞ΅, *where *c* is the slope of the line. Thus we can generalize to f(x + Ξ΅) β f(x) + Ξ΅fβ(x) for small values Ξ΅

Based on the above formulation, the derivate/slope can be used to make small changes to *x*, *xβ = x+Ξ΅*, such that f(xβ) < f(x).

We can thus reduce f(x) by moving x in small steps in the opposite direction of the gradient or slope. **This is the Gradient Descent technique. **Formally, gradient descent is an iterative process in which we start with a random value of x and at each step update it by:

The term *Ξ³* controls how big of a step we take along the derivate/slope. The term is commonly known as the learning rate of the algorithm.

Note: We need the loss functions to be convex for gradient descent to work.

## Using Gradient Descent To Solve a System of Linear Equations

A system of linear equations can be seen as a simple single layer Neural Network (without the bias and activation). So solving one using gradient descent allows us to see gradient descent in practice without getting distracted by the other components of a Neural Network. With this out of the way, suppose we are given the following system of linear equations:

We want to find *x*β, *x*β, and *x*β such that the left side of the equation sums up to the right side.

Note: xβ, xβ, and xβ are generally referred to as weights and denoted by the term w. I am using xinstead ofwto be consistent with the notation above.

To solve this using Gradient Descent, we first need to define the loss or the objective function that should be optimized/reduced. Mean Squared Error (MSE) is a good option here. MSE is given by:

Here Yα΅’ is the expected output (1, 4, or 0) and ΕΆα΅’ is the value of the LHS of the equations*. *We will use *Ax = ΕΆ* to denote the set of equations to simplify the notations. Now to use gradient descent we need to compute the derivative of the loss wrt. to x:

While the gradient of the loss wrt. to x can be directly computed here, I am breaking it down into the gradient of loss wrt to *ΕΆ *and then the gradient of *ΕΆ *wrt to x (using chain rule) as that is how it is usually done in code as it allows for automatic gradient computation. Now letβs calculate the two terms. I am ignoring the summation term to simplify things, but we need to apply the summation and averaging before updating the values of x.

Putting it all together, the following is a code snippet of how to use gradient descent to solve the system of equations

One thing to note above is that the shape of loss_wrt_x or *dMSE/dx* is the same as X. This means that the gradient is learned and applied per dimension, though the update step i.e. Learning Rate or *Ξ΅ *is the same for all dimensions. While the above is the basic Gradient Descent setup, there are advanced gradient descent algorithms such as AdaGrad and ADAM which try to learn different update steps for different dimensions. For more information about advanced gradient-based algorithms, refer to this excellent article

Running the above function on the set of equations from earlier we get.

`A = np.array([[5, 1, -1], [2, -1, 1], [1, 3, -2]])`

Y = np.array([1, 4, 0])

x = sovleUsingGradient(A, Y, lr=0.01, iters=10000)

Loss at iter: 0: 5.666666666666667

Loss at iter: 1000: 0.5797261956378661

Loss at iter: 2000: 0.13383958398415846

Loss at iter: 3000: 0.0308991285469566

Loss at iter: 4000: 0.007133585719112376

Loss at iter: 5000: 0.0016469087513129724

Loss at iter: 6000: 0.0003802167019434473

Loss at iter: 7000: 8.77794475992898e-05

Loss at iter: 8000: 2.0265368095217914e-05

Loss at iter: 9000: 4.678602511941347e-06

x

array([0.71418766, 4.42539244, 6.99559903])

A.dot(x)

array([ 1.00073170e+00, 3.99858189e+00, -8.33062319e-04])

As we can see the loss goes down over time and after 10000 iterations *Ax* gets close to *Y*.

## Building a Multi-Layer Perceptron

Letβs now see how gradient descent is used in actual neural network training. For this, we will build a Multi-Layer Perceptron or MLP on a simplified MNIST dataset with only 2 classes (0 and 1) instead of 10. The following figure captures the overall architecture that we will be building.

We will go over the different components mentioned in the diagram above one by one.

## Binary Cross Entropy

As this is a binary classification problem we will use the binary cross entropy loss. Following is the definition of binary cross-entropy

Similar to MSE in order to train a model using this loss we need to compute its gradient:

Writing this out in python/numpy:

## Sigmoid Activation

Another new component is the sigmoid activation function. Activation functions in general are meant to add non-linearity to a Neural Network. Sigmoid Activation function is defined as:

Here again, we need to compute the derivative of the function in order to use it in gradient descent/backpropagation.

Note: Ο(x) will be the same asΕΆin the binary cross entropy formula. So the combined gradient of the sigmoid and BinaryCrossEntropy becomesΕΆ-Y,which is the same as the gradient of MSE. As the sigmoid function is responsible for converting the logits/neural network outputs into probability values for the cross-entropy function, the similarity between the two gradient formulations is interesting. It implies that when it comes to learning the weights there isnβt any difference between a regression and a classification problem.

Writing this out in python/numpy:

## Dense Layer

The final component is the Dense Layer. A dense layer is defined as:

Here *w* is the weight matrix, *X* is the input, and *b* is the bias term. Mathematically the representation is very similar to that of a system of linear equations. So the derivative needed to update the weights will be the same. There are however two things worth discussing here:

**Weight Initialization**: Unlike the system of linear equations, the choice of initial weights plays a big role in training convergence i.e. in the ability of gradient descent to work. Generally, unit norm initialization gives the best result. For a more in-depth discussion on weight initialization refer to this excellent article.**Automatic Gradient Computation**: In an MLP there can be multiple dense layers stacked on top of each other. Mathematically, the gradient for any layer represented by x whose output is ΕΆ can be written as.

The latter part of the multiplication is easy to compute. For the first part dLoss / dΕΆ we essentially need the combined gradient for all the layers after the given layer. As ΕΆ is the input to the next layer this means each layer needs to return the derivative of its output wrt to its input multiplied by the combined gradient of the layers above it

Putting this all together in python/numpy:

We now have all the pieces we need to build and train an MLP. I am leaving the final stitching of all the components together as an exercise. If you however are stuck or are unsure about anything you can refer to the full implementation below:

## numpynet/mlp.py at main Β· monkeydunkey/numpynet

### Neural Net implementation in pure numpy and python – numpynet/mlp.py at main Β· monkeydunkey/numpynet

github.com

## Additional References

Finally here are some additional references for gradient descent and multi-layer perceptrons.

*An overview of gradient descent optimisation algorithms**Initializing Neural Networks**Deep Learning Book: Machine Learning Basics*- D2l.ai: Multilayer Perceptrons

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI