Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.

# Let’s Learn: Neural Nets #3 — Activation Functions

Last Updated on July 20, 2023 by Editorial Team

#### Author(s): Bradley Stephen Shaw

Originally published on Towards AI.

## A beginner’s guide to activation functions in neural nets.

Today we’ll be looking at activation functions in neural nets — the who, what, where, and why of it.

If you’re interested in what I’m doing — or are on a similar path yourself — take a look at my journey¹.

## A brief recap

From our research on nodes², we know a few things about activation functions:

• An activation function — or a transfer function — is simply a mathematical function.
• The activation function takes the sum of weighted inputs and bias and turns it into a number (i.e., an output).
• This output flows into either another node, or another activation function.

## What don’t I know?

Apart from “lots”, a few specific things:

1. Does every node have an activation function?
2. Why do we need activation functions?
3. What kinds of activation functions are used? What makes a good activation function?
4. How do we choose the “best” activation function?

## Does every node have an activation function?

If I were to guess, I would suspect that nodes that receive the input data would not have activation functions (since it seems strange to me that the network would perform some sort of transformation immediately). KDNuggets³ explains further:

Activation functions reside within neurons, but not all neurons… Hidden and output layer neurons possess activation functions, but input layer neurons do not.

## Why do we need activation functions?

No idea about this one — I’ll refer to the experts.

First up, our friends at KDNuggets³:

Activation functions perform a transformation on the input received, in order to keep values within a manageable range. Since values in the input layers are generally centered around zero and have already been appropriately scaled, they do not require transformation.

However, these values, once multiplied by weights and summed, quickly get beyond the range of their original scale, which is where the activation functions come into play, forcing values back within this acceptable range and making them useful.

Sounds like we need to control the information within the neural network. Perhaps something to do with optimization, or the way the network learns?

What else can we find out?

From V7 Labs⁴:

…the purpose of an activation function is to add non-linearity to the neural network.

They elaborate further — on the need for activation functions and why the additional effort is worth it:

Let’s suppose we have a neural network working without the activation functions.

In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural network; all layers will behave in the same way because the composition of two linear functions is a linear function itself.

Although the neural network becomes simpler, learning any complex task is impossible, and our model would be just a linear regression model.

Aha! So that’s where the nonlinearity in the model comes from!

## What kinds of activation functions are used?

I’m guessing that we could potentially use any function as long it conforms to a few requirements (maybe continuity, smoothness, and a well-behaved gradient?).

A few articles (see [3] — [7] below) agree on at least three kinds of activation function: the sigmoid, the hyperbolic tangent (tanh), and the rectified linear unit (reLU). There are loads of variations mentioned there — I’ll review these at a later time.

Mathematically these are the following:

Let’s see what they look like:

Taking a look at each in more detail.

Sigmoid:

• Also known as the logistic function.
• Transforms input to lie within the [0,1] interval. Potentially useful when dealing with probabilities?
• Seems to be smooth and continuous. A useful characteristic in general as gradient descent is likely to be involved at some point in some way?
• Its derivative is sigmoid(x) * (1 — sigmoid(x)), which, as far as I know, is defined everywhere.
• The output of the logistic function is not symmetric around zero. So the output of all the neurons will be of the same sign. This makes the training of the neural network more difficult and unstable⁴.
• Apparently, the sigmoid suffers from the “vanishing gradient” problem⁴.

Hyperbolic tangent:

• Also referred to as “tanh”.
• Input is transformed to lie within [-1,1] and is centered on zero.
• Appears to be a scaled and shifted sigmoid, and so might be preferable to the sigmoid itself³.
• Since the output is centered on zero, we can map the output values as strongly negative, neutral, or strongly positive⁴.
• Usually used in hidden layers of a neural network as its values lie between -1 to 1; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and makes learning for the next layer much easier⁴.

The gradient seems to be more pronounced in this activation function — does it also suffer from the vanishing gradient issue?

Apparently, it does⁴:

It also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.

Note: Although both sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and the gradients are not restricted to move in a certain direction. Therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.

Rectified linear unit:

• Also referred to as the “reLU”.
• Output values have a hard lower bound of zero. There is no upper bound for output values. Can this lead to issues?
• Shown to accelerate the convergence of gradient descent compared to the above functions but could lead to neuron death³. What is this? Sounds scary!
• ReLU has become the default activation function for hidden layers³
• The main catch here is that the ReLU function does not activate all the neurons at the same time. The neurons will only be deactivated if the output of the linear transformation is less than zero.⁴
• Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions. ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.⁴

## What makes a good activation function?

From the above, it sounds like a good activation function will have the following characteristics:

1. The activation function should be continuous and differentiable.
2. The activation function should avoid the vanishing gradient issue.
3. The activation function should not introduce overly demanding computational requirements — i.e., it should allow for fast training.

## How do we choose the “best” activation function?

… and do we need to choose separate activation functions for hidden neurons and the final output?

Turns out we do, and it looks like we need to structure the output layer and use an appropriate activation function based on the structure of the output layer and the modeling problem.

V7 Labs again⁴:

You need to match your activation function for your output layer based on the type of prediction problem that you are solving — specifically, the type of predicted variable.

… and Machine Learning Mastery⁸ with a little more detail:

You must choose the activation function for your output layer based on the type of prediction problem that you are solving. Specifically, the type of variable that is being predicted.

For example, you may divide prediction problems into two main groups, predicting a categorical variable (classification) and predicting a numerical variable (regression).

If your problem is a regression problem, you should use a linear activation function. If your problem is a classification problem, then there are three main types of classification problems and each may use a different activation function.

Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model will predict the probability of class membership (e.g. probability that an example belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).

If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used. If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation function is used.

And then, what about choosing activation functions for the neurons in the hidden layer?

A neural network will almost always have the same activation function in all hidden layers. It is most unusual to vary the activation function through a network model…

Both the sigmoid and tanh functions can make the model more susceptible to problems during training, via the so-called vanishing gradients problem…

The activation function used in hidden layers is typically chosen based on the type of neural network architecture

Modern neural network models with common architectures, such as MLP and CNN, will make use of the ReLU activation function, or extensions.⁸

And V7 Labs⁴ gives us some guidelines and rules of thumb:

You need to match your activation function for your output layer based on the type of prediction problem that you are solving — specifically, the type of predicted variable.

Here’s what you should keep in mind.

As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.

And here are a few other guidelines to help you out.

1. ReLU activation function should only be used in the hidden layers.

2. Sigmoid (logistic) and tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).

## Wrapping up

As it turns out, not as quick an article as I had initially thought.

## What have we learned today?

Activation functions are used in neural nets to scale information as it passes through the network, but their more important role is to introduce non-linearity into the model.

Activation functions are applied to neurons in the hidden and output layers only. The activation function used in the output layer depends on the problem at hand and will differ between regression and classification tasks (as will the structure of the output layer).

Usually, the same activation is applied to all neurons in the hidden layers.

Technically any function can be used as an activation function. However, activation functions will ideally be continuous, smooth, and have well-behaved gradients. Producing output centered around zero is an advantage.

We investigated three commonly used activation functions: the sigmoid, the hyperbolic tangent, and the rectified linear unit. We visualized each and learned that some of these (the sigmoid and hyperbolic tangent) suffer from the “vanishing gradient” problem. We also learned that while simple, the rectified linear unit provides some advantage in terms of computation speed (although it also suffers from something called the “dying reLU” problem).

We also found out that the rectified linear unit should only be used in the hidden layers. We should avoid using the sigmoid and hyperbolic tangent in the hidden layers as they can introduce issues when training the model.

More importantly, there do not seem to be any hard-and-fast rules in selecting activation functions; modelers should trial a few and select the combinations which deliver better outcomes, perhaps starting with the simpler reLU and moving on to more complex activation functions as part of the development cycle.

Things for me to add to my “to review” list:

1. Vanishing (and exploding) gradients.
2. Neuron death.
3. (Non)-saturation in neural nets.
4. The “dying reLU” problem.

Next up, structuring the layers in a neural network.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI