Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Deep Learning Weight Initialization Techniques
Artificial Intelligence   Data Science   Latest   Machine Learning

Deep Learning Weight Initialization Techniques

Last Updated on June 18, 2024 by Editorial Team

Author(s): Ayo Akinkugbe

Originally published on Towards AI.

Photo by Jakob Boman on Unsplash

Introduction

A neural network is a constellation of neurons arranged in layers. Each layer is a mathematical transformation that can be linear, non-linear, or a combination of both. Linear transformations use weights and sometimes biases. To build the layers of a neural network, the values for weights and biases need to be first guessed and then accurately updated via training. The first pass at a guess to the values of weights is referred to as weight initialization. A lot of things can go wrong if this first guess is lousy or ambiguous. In fact, lousy weight initialization can prevent the neural network from producing accurate results. Some fundamental problems that can occur from improper weight initialization include:

  • Vanishing gradient problem — Since weights are initialized by a mathematical guess in the first forward pass, they have to be updated with gradients in the back pass to optimize the neural network. Imagine you’re trying to pass a message through a long line of people. As each person whispers the message, the message morphs as it passes down the line. By the time it reaches the end, the message might have lost its actual meaning or intent. Similarly, in a deep learning network, the gradient (the “message” about how to adjust weights) gets so small that it’s almost useless to update the weights accurately when it reaches the early layers. Starting out with very small weights can create vanishing gradients. The goal of an optimal weight initialization technique is to start with a bulletproof weight size that ensures the gradients substantially update weights as they travel through the many layers of the network.
  • Exploding gradient problem — The exploding gradient problem is like the inverse scenario of the vanishing gradient problem. In this case, the gradients get too large. Similar to the vanishing gradient problem, large weights can cause the gradient to make large updates to weights in a neural network, leading to the exploding gradient problem. This can cause the network to become unstable and unable to learn from training data.

Though there are other components of a network that can cause these problems, initializing weights properly can help prevent the vanishing and exploding gradient problem.

Though there are a couple of well-researched and tested techniques for weight initialization, these three tenets should guide the process

  • Weights should not be relatively too small or too big
  • Weights should not be the same
  • Weights should have good variance.

Weight Initialization Techniques

There are a number of well-researched and proven weight initialization techniques suitable for different training scenarios and activation functions in a neural network. These are mostly named after the researchers who developed them. A common thread amongst these techniques is the sampling of weights from a type of normal or uniform distribution. These include Kaiming, He, and Xavier/Glorot distribution.

Kaiming (uniform) Initialization

In the Uniform or Kaiming intialization technique, weights are sampled from a uniform distribution between a and b — where a is the negative inverse of the square root of the number of neuron layer inputs and b is the positive inverse of the square root of the number of neuron layer inputs.

Benefits:

  • Works well with the ReLU activation function
  • Mitigates the vanishing and exploding gradient problems
  • Produces good results in the training of neural networks with many layers
  • Preserves weight variance between layers in the forward pass of training.

Introductory paper:

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study…

arxiv.org

Xavier/Gorat Normal Initialization

In the Normal Xavier/Gorat initialization technique, weights are sampled from a normal distribution with a mean at zero and standard deviation 𝞂 — where 𝞂 is the square root of two divided by the sum of a number of neuron layer inputs and a number of neuron layer outputs.

Benefits:

  • Works well with and optimized for Sigmoid and Tanh activation functions. Less suited for ReLU
  • Mitigates the vanishing and exploding gradient problems
  • Ensures balanced propagation of the input signal through the layers of the network
  • Produces good results in the training of neural networks with many layers

Introductory paper:

Understanding the difficulty of training deep feedforward neural networks

Whereas before 2006 it appears that deep multi-layer neural networks were not successfully trained, since then several…

proceedings.mlr.press

Xavier/Gorat Uniform Initialization

In the Uniform Xavier/Gorat initialization technique, weights are sampled from a uniform distribution between a and b — where a is the square root of 6 times the negative inverse of the square root of the sum of neuron layer inputs and outputs and b is the square root of 6 times the negative inverse of the square root of the sum of neuron layer inputs and outputs.

He Normal Initialization

In the Normal He initialization technique, weights are sampled from a normal distribution with a mean at zero and standard deviation 𝞂 — where 𝞂 is the square root of two divided by the number of neuron layer inputs.

Benefits:

  • Works well with the ReLU activation function
  • Mitigates the vanishing and exploding gradient problems
  • Produces good results in the training of neural networks with many layers
  • Preserves weight variance between layers in the forward pass of training.

He Uniform Initialization

In the Uniform He initialization technique, weights are sampled from a uniform distribution between a and b — where a is the square root of 6 times the negative inverse of the square root of the neuron layer inputs and b is the square root of 6 times the negative inverse of the square root of the neuron layer inputs.

Conclusion

Proper weight initialization using the techniques mentioned in this article has been proven to be effective through research using various datasets. Using these methods can significantly enhance the stability and performance of neural networks, leading to more accurate and reliable outcomes. As neural networks continue to evolve and expand in complexity, there exist research opportunities to explore more weight initialization techniques that leverage different distributions and proportions.

For more on Neural Networks 🧠, Check out other posts in this series:

Ayo Akinkugbe

Neural Networks

View list3 stories

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓