Deep Learning Weight Initialization Techniques
Last Updated on June 18, 2024 by Editorial Team
Author(s): Ayo Akinkugbe
Originally published on Towards AI.
Introduction
A neural network is a constellation of neurons arranged in layers. Each layer is a mathematical transformation that can be linear, non-linear, or a combination of both. Linear transformations use weights and sometimes biases. To build the layers of a neural network, the values for weights and biases need to be first guessed and then accurately updated via training. The first pass at a guess to the values of weights is referred to as weight initialization. A lot of things can go wrong if this first guess is lousy or ambiguous. In fact, lousy weight initialization can prevent the neural network from producing accurate results. Some fundamental problems that can occur from improper weight initialization include:
- Vanishing gradient problem β Since weights are initialized by a mathematical guess in the first forward pass, they have to be updated with gradients in the back pass to optimize the neural network. Imagine youβre trying to pass a message through a long line of people. As each person whispers the message, the message morphs as it passes down the line. By the time it reaches the end, the message might have lost its actual meaning or intent. Similarly, in a deep learning network, the gradient (the βmessageβ about how to adjust weights) gets so small that itβs almost useless to update the weights accurately when it reaches the early layers. Starting out with very small weights can create vanishing gradients. The goal of an optimal weight initialization technique is to start with a bulletproof weight size that ensures the gradients substantially update weights as they travel through the many layers of the network.
- Exploding gradient problem β The exploding gradient problem is like the inverse scenario of the vanishing gradient problem. In this case, the gradients get too large. Similar to the vanishing gradient problem, large weights can cause the gradient to make large updates to weights in a neural network, leading to the exploding gradient problem. This can cause the network to become unstable and unable to learn from training data.
Though there are other components of a network that can cause these problems, initializing weights properly can help prevent the vanishing and exploding gradient problem.
Though there are a couple of well-researched and tested techniques for weight initialization, these three tenets should guide the process
- Weights should not be relatively too small or too big
- Weights should not be the same
- Weights should have good variance.
Weight Initialization Techniques
There are a number of well-researched and proven weight initialization techniques suitable for different training scenarios and activation functions in a neural network. These are mostly named after the researchers who developed them. A common thread amongst these techniques is the sampling of weights from a type of normal or uniform distribution. These include Kaiming, He, and Xavier/Glorot distribution.
Kaiming (uniform) Initialization
In the Uniform or Kaiming intialization technique, weights are sampled from a uniform distribution between a and b β where a is the negative inverse of the square root of the number of neuron layer inputs and b is the positive inverse of the square root of the number of neuron layer inputs.
Benefits:
- Works well with the ReLU activation function
- Mitigates the vanishing and exploding gradient problems
- Produces good results in the training of neural networks with many layers
- Preserves weight variance between layers in the forward pass of training.
Introductory paper:
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we studyβ¦
arxiv.org
Xavier/Gorat Normal Initialization
In the Normal Xavier/Gorat initialization technique, weights are sampled from a normal distribution with a mean at zero and standard deviation π β where π is the square root of two divided by the sum of a number of neuron layer inputs and a number of neuron layer outputs.
Benefits:
- Works well with and optimized for Sigmoid and Tanh activation functions. Less suited for ReLU
- Mitigates the vanishing and exploding gradient problems
- Ensures balanced propagation of the input signal through the layers of the network
- Produces good results in the training of neural networks with many layers
Introductory paper:
Understanding the difficulty of training deep feedforward neural networks
Whereas before 2006 it appears that deep multi-layer neural networks were not successfully trained, since then severalβ¦
proceedings.mlr.press
Xavier/Gorat Uniform Initialization
In the Uniform Xavier/Gorat initialization technique, weights are sampled from a uniform distribution between a and b β where a is the square root of 6 times the negative inverse of the square root of the sum of neuron layer inputs and outputs and b is the square root of 6 times the negative inverse of the square root of the sum of neuron layer inputs and outputs.
He Normal Initialization
In the Normal He initialization technique, weights are sampled from a normal distribution with a mean at zero and standard deviation π β where π is the square root of two divided by the number of neuron layer inputs.
Benefits:
- Works well with the ReLU activation function
- Mitigates the vanishing and exploding gradient problems
- Produces good results in the training of neural networks with many layers
- Preserves weight variance between layers in the forward pass of training.
He Uniform Initialization
In the Uniform He initialization technique, weights are sampled from a uniform distribution between a and b β where a is the square root of 6 times the negative inverse of the square root of the neuron layer inputs and b is the square root of 6 times the negative inverse of the square root of the neuron layer inputs.
Conclusion
Proper weight initialization using the techniques mentioned in this article has been proven to be effective through research using various datasets. Using these methods can significantly enhance the stability and performance of neural networks, leading to more accurate and reliable outcomes. As neural networks continue to evolve and expand in complexity, there exist research opportunities to explore more weight initialization techniques that leverage different distributions and proportions.
For more on Neural Networks 🧠, Check out other posts in this series:
Neural Networks
View list3 stories
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI