Is GELU, the ReLU successor ?

Last Updated on August 30, 2022 by Editorial Team

Author(s): Poulinakis Kon

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Can we combine regularization and activation functions? In 2016 a paper from authors Dan Hendrycks and Kevin Gimpel came out. Since then, the paper now has been updated 4 times. The authors introduced a new activation function, the Gaussian Error Linear Unit, GELU.

Demystifying GELU

The motivation behind GELU is to bridge stochastic regularizers, such as dropout, with non-linearities, i.e., activation functions.

Dropout regularization stochastically multiplies a neuron’s inputs with 0, randomly rendering them inactive. On the other hand, ReLU activation deterministically multiplies inputs with 0 or 1 dependent upon the input’s value.

GELU merges both functionalities by multiplying inputs by a value from 0 to 1. However, the value of this zero-one mask, while stochastically determined, is also dependent upon the input’s value.

Mathematically, GELU is formulated as :

Φ(x) is the cumulative distribution function (CDF) of the standard normal distribution. The choice of this function stems from the fact that neuron inputs tend to follow a normal distribution, especially when Batch Normalization is used. So, essentially GELU has a higher probability of dropping a neuron (multiplying by 0) while x decreases since P(X ≤ x) becomes smaller. Please take a moment to think about this and let it sink. So the transformation applied by GELU is stochastic, yet it depends upon the input’s value through Φ(x).

Figure 1: The Gaussian Error Linear Unit (μ=0, σ=1), the Rectified Linear Unit, and the Exponential Linear Unit (ELU) (α=1). Source [1]

Observe how GELU(x) starts from zero for small values of x since the CDF P(X≤x) is almost equal to 0. However, around the value of -2, P(X≤x) starts increasing. Hence we see GELU(x) deviating from zero. For the positive values, since P(X≤x) moves closer to a value of 1, GELU(x) starts approximating ReLU(x). In the figure below, the red line represents the CDF of the Standard Normal Distribution N(0,1) i.e., P(X≤x).

Figure 2: Cumulative Distribution functions for different Gaussian Distributions. Red line represents CDF of the Standard Normal N(0,1) . Source Wikipedia.

Approximations

GELU can also be approximated through the formulas

if greater feedforward speed is worth the cost of exactness.

Variations

The GELU can also be modified by using different CDFs. For example, if the Logistic Distribution CDF (x) is used, then we would get the Sigmoid Linear Unit (SiLU) x(x). Moreover, we could pick a CDF N(μ, σ) with μ and σ being learnable hyperparameters.

Advantages

The authors in [1], experimented with the use of GELU against ReLU and ELU activation functions in 3 different benchmark datasets covering the tasks of computer vision (CIFAR 10/100 classification), natural language processing (Twitter part of speech tagging), and audio phoneme recognition (TIMIT frame classification).

Throughout their experiments, they observed a consistent improvement in accuracy when using GELU compared to ReLU, and ELU. Analytically :

The table above presents the test error rate in 4 datasets. GELU consistently achieves the lowest test error rate, posing as a promising alternative to ReLU and ELU activations.

An Interesting Fact

The well-known paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale” that made Vision Transformers popular makes use of GELU activation inside the MLP of the encoder transformer block (section 3.1). This suggests that GELU is considered a good option by high-quality researchers.

REFERENCES

[1] Gaussian Error Linear Units (GELUs)

[2] https://en.wikipedia.org/wiki/Normal_distribution

[3] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

Thanks for reading, feel free to reach out!

My Links: Medium | LinkedIn | GitHub

Is GELU, the ReLU successor ? was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Is GELU, the ReLU successor ?

Author(s): Poulinakis Kon

Demystifying GELU

Approximations

Variations

Advantages

An Interesting Fact

REFERENCES

JOIN NOW!

🔥 Recommended Articles 🔥

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Is GELU, the ReLU successor ?

Author(s): Poulinakis Kon

Demystifying GELU

Approximations

Variations

Advantages

An Interesting Fact

REFERENCES

JOIN NOW!

🔥 Recommended Articles 🔥

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement