Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Gradient-Based Learning In Numpy
Latest   Machine Learning

Gradient-Based Learning In Numpy

Last Updated on September 17, 2024 by Editorial Team

Author(s): Shashank Bhushan

Originally published on Towards AI.

Photo by Erol Ahmed on Unsplash

What is Gradient-Based Learning?

Mathematically training a neural network can be framed as an optimization problem in which we either try to maximize or minimize some function f(x). f(x) is usually referred to as the cost function, loss function, or error function. Common examples are Mean Squared Error for regression problems and Cross Entropy for classification problems. Maximization of f(x) can be framed as a minimization problem by changing the objective function to –f(x), which is what is done in practice. So from now on, we will only discuss minimizing f(x).

As the loss function f is fixed, minimizing f(x) becomes finding x that minimizes f(x). Gradient Descent is a gradient-based learning/optimization algorithm that can be used to find this min value. It's the most popular way to train Neural Networks. To understand how Gradient Descent works, we first need to review a bit of calculus. For a given function y = f(x), the derivative or gradient denoted by f’(x) describes the slope of f(x) at x. The slope at a given point x allows us to estimate how small changes to x change f(x):

If it’s unclear why the above formulation works, consider the following example.

Image by Author

For the function specified by the blue curve (figure on the left), we want to know how small changes to the x (specified by the red dot) change f(x), i.e. what will be the value of x+Ξ΅ on the blue curve? The orange line specifies the slope at x. Now, if we were to zoom in a lot at x (figure on the right) the blue curve would turn into a straight line and this line would be the same as the orange line. For any line f(x) = cx +b, the value at x+Ξ΅ will be cx+b+cΞ΅ = f(x)+cΞ΅, where c is the slope of the line. Thus we can generalize to f(x + Ξ΅) β‰ˆ f(x) + Ξ΅f’(x) for small values Ξ΅

Based on the above formulation, the derivate/slope can be used to make small changes to x, x’ = x+Ξ΅, such that f(x’) < f(x).

We can thus reduce f(x) by moving x in small steps in the opposite direction of the gradient or slope. This is the Gradient Descent technique. Formally, gradient descent is an iterative process in which we start with a random value of x and at each step update it by:

The term Ξ³ controls how big of a step we take along the derivate/slope. The term is commonly known as the learning rate of the algorithm.

Note: We need the loss functions to be convex for gradient descent to work.

Using Gradient Descent To Solve a System of Linear Equations

A system of linear equations can be seen as a simple single layer Neural Network (without the bias and activation). So solving one using gradient descent allows us to see gradient descent in practice without getting distracted by the other components of a Neural Network. With this out of the way, suppose we are given the following system of linear equations:

We want to find x₁, xβ‚‚, and x₃ such that the left side of the equation sums up to the right side.

Note: x₁, xβ‚‚, and x₃ are generally referred to as weights and denoted by the term w. I am using x instead of w to be consistent with the notation above.

To solve this using Gradient Descent, we first need to define the loss or the objective function that should be optimized/reduced. Mean Squared Error (MSE) is a good option here. MSE is given by:

Here Yα΅’ is the expected output (1, 4, or 0) and ΕΆα΅’ is the value of the LHS of the equations. We will use Ax = ΕΆ to denote the set of equations to simplify the notations. Now to use gradient descent we need to compute the derivative of the loss wrt. to x:

While the gradient of the loss wrt. to x can be directly computed here, I am breaking it down into the gradient of loss wrt to ΕΆ and then the gradient of ΕΆ wrt to x (using chain rule) as that is how it is usually done in code as it allows for automatic gradient computation. Now let’s calculate the two terms. I am ignoring the summation term to simplify things, but we need to apply the summation and averaging before updating the values of x.

Putting it all together, the following is a code snippet of how to use gradient descent to solve the system of equations

One thing to note above is that the shape of loss_wrt_x or dMSE/dx is the same as X. This means that the gradient is learned and applied per dimension, though the update step i.e. Learning Rate or Ξ΅ is the same for all dimensions. While the above is the basic Gradient Descent setup, there are advanced gradient descent algorithms such as AdaGrad and ADAM which try to learn different update steps for different dimensions. For more information about advanced gradient-based algorithms, refer to this excellent article

Running the above function on the set of equations from earlier we get.

A = np.array([[5, 1, -1], [2, -1, 1], [1, 3, -2]])
Y = np.array([1, 4, 0])
x = sovleUsingGradient(A, Y, lr=0.01, iters=10000)

Loss at iter: 0: 5.666666666666667
Loss at iter: 1000: 0.5797261956378661
Loss at iter: 2000: 0.13383958398415846
Loss at iter: 3000: 0.0308991285469566
Loss at iter: 4000: 0.007133585719112376
Loss at iter: 5000: 0.0016469087513129724
Loss at iter: 6000: 0.0003802167019434473
Loss at iter: 7000: 8.77794475992898e-05
Loss at iter: 8000: 2.0265368095217914e-05
Loss at iter: 9000: 4.678602511941347e-06

x
array([0.71418766, 4.42539244, 6.99559903])

A.dot(x)
array([ 1.00073170e+00, 3.99858189e+00, -8.33062319e-04])

As we can see the loss goes down over time and after 10000 iterations Ax gets close to Y.

Building a Multi-Layer Perceptron

Let’s now see how gradient descent is used in actual neural network training. For this, we will build a Multi-Layer Perceptron or MLP on a simplified MNIST dataset with only 2 classes (0 and 1) instead of 10. The following figure captures the overall architecture that we will be building.

Fig: MLP Setup for Determining if Digit is 0 or 1, figure by author

We will go over the different components mentioned in the diagram above one by one.

Binary Cross Entropy

As this is a binary classification problem we will use the binary cross entropy loss. Following is the definition of binary cross-entropy

Similar to MSE in order to train a model using this loss we need to compute its gradient:

Writing this out in python/numpy:

Sigmoid Activation

Another new component is the sigmoid activation function. Activation functions in general are meant to add non-linearity to a Neural Network. Sigmoid Activation function is defined as:

Here again, we need to compute the derivative of the function in order to use it in gradient descent/backpropagation.

Note: Οƒ(x) will be the same as ΕΆ in the binary cross entropy formula. So the combined gradient of the sigmoid and BinaryCrossEntropy becomes ΕΆ-Y, which is the same as the gradient of MSE. As the sigmoid function is responsible for converting the logits/neural network outputs into probability values for the cross-entropy function, the similarity between the two gradient formulations is interesting. It implies that when it comes to learning the weights there isn’t any difference between a regression and a classification problem.

Writing this out in python/numpy:

Dense Layer

The final component is the Dense Layer. A dense layer is defined as:

Here w is the weight matrix, X is the input, and b is the bias term. Mathematically the representation is very similar to that of a system of linear equations. So the derivative needed to update the weights will be the same. There are however two things worth discussing here:

  1. Weight Initialization: Unlike the system of linear equations, the choice of initial weights plays a big role in training convergence i.e. in the ability of gradient descent to work. Generally, unit norm initialization gives the best result. For a more in-depth discussion on weight initialization refer to this excellent article.
  2. Automatic Gradient Computation: In an MLP there can be multiple dense layers stacked on top of each other. Mathematically, the gradient for any layer represented by x whose output is ΕΆ can be written as.

The latter part of the multiplication is easy to compute. For the first part dLoss / dΕΆ we essentially need the combined gradient for all the layers after the given layer. As ΕΆ is the input to the next layer this means each layer needs to return the derivative of its output wrt to its input multiplied by the combined gradient of the layers above it

Putting this all together in python/numpy:

We now have all the pieces we need to build and train an MLP. I am leaving the final stitching of all the components together as an exercise. If you however are stuck or are unsure about anything you can refer to the full implementation below:

numpynet/mlp.py at main Β· monkeydunkey/numpynet

Neural Net implementation in pure numpy and python – numpynet/mlp.py at main Β· monkeydunkey/numpynet

github.com

Additional References

Finally here are some additional references for gradient descent and multi-layer perceptrons.

  1. An overview of gradient descent optimisation algorithms
  2. Initializing Neural Networks
  3. Deep Learning Book: Machine Learning Basics
  4. D2l.ai: Multilayer Perceptrons

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓