# Multi-task Learning (MTL) and The Role of Activation Functions in Neural Networks [Train MLP With and Without Activation]

**Author(s): JAIGANESAN**

Originally published on Towards AI.

In this article, we’re going to explore two important concepts in deep learning: multi-task learning (MTL) and the role of activation functions in neural networks. We’ll learn how MTL works by training a multi-layer perceptron (MLP) for both **binary** and **multi-class** **classification**. We’ll also see how activation functions help to learn complex patterns by training MLP with and without activation functions. By the end of this article, you’ll have a clear understanding of **MTL and activation functions**

Note:I published these articles in My LinkedIn Profile already (Publishing small articles in Linkedin). But As it is an important concept, I just want to share it with Medium Community.

## Multi-task learning(MTL) with Multi-Layer Perceptron (MLP) and Deep Learning Techniques

Multi-task learning is a Method in Machine Learning where Multiple related tasks are learned simultaneously, leveraging shared information among them to improve performance. Instead of training a separate model for each task, MTL trains a single model to handle multiple tasks. We are making the model to learn different tasks at the same network. By giving one record/ vector (Independent Variables) we get multiple outputs (Targets or dependent Variables).

We’ll start by exploring concepts behind MTL, its benefits, and its drawbacks. Then, we’ll look at how it works, using architecture, code, and visual workflow examples. I took a Kaggle dataset to help illustrate the concept, we will explore all this concept with this example. And You’ll how it can be applied in real-world scenarios.

## Concepts Behind Multi-Task Learning (MTL):

In MTL, **some layers or parameters** are shared across tasks, allowing the model to learn common features that benefit all tasks. The model is trained on different tasks simultaneously, and the parameters are updated based on the **combined loss from all tasks**.

In addition to shared layers, MTL models typically have** task-specific layers** that handle the **unique aspects of each task**. The final output layer of the model provides the desired output for each task.

## So, what are the advantages and disadvantages of MTL?

On the plus side, MTL can improve the performance of individual tasks when they are related. It can also act as a regularizer, preventing the model from **overfitting on a single task**. Additionally, MTL can be seen as a form of **transfer learning**.

However, there are **some drawbacks to consider**. Conflicting gradients from different tasks can affect the learning process, making it challenging to balance the learning across tasks. Furthermore, as the number of tasks increases, the complexity and computational cost of MTL can grow significantly.

## Architecture, Code, and Visual Workflow

We are Going to explore Multi-Task Learning from a real-world use case, I take a Kaggle dataset (Heart Disease Dataset) to predict two targets. It has 12 independent variables or features like age, sex, chest pain type, and resting blood pressure. Two Target Variables (Dependent Variables) are thal (thalassemia) and heart disease.

**Two Tasks:**

**Task 1:** Predicting Thalassemia (Multi-Class Classification) First task is to predict the type of thalassemia a patient has, if any. This is a multi-class classification problem, where we need to predict one of three outcomes: reversed thalassemia, fixed thalassemia, or normal (no thalassemia).

**Task 2:** Predicting Heart Disease (Binary Classification) Second task is to determine whether a patient has heart disease or not. This is a binary classification problem, where we need to predict one of two outcomes: yes or no.

## Multi-Task Learning with MLP

Let’s take a closer look at the neural network architecture we’re using for our Multi-Task Learning (MTL) tasks. As shown in Image 1, our model has two hidden layers that act as a shared representation, learning jointly for both tasks. Each task then has its own separate hidden layer. The output layers are determined by the target of each task, with one layer for binary classification (heart disease) and another for multi-class classification (thalassemia).

## Multi-Task Learning Code

Now, let’s take a look at the code that brings this architecture to life. The code snippet below replicates the architecture we saw in Image 1. If you’re interested in exploring further, I’ve also included a reference to my Kaggle notebook where you can see the code in action.[Reference Section]. Please make sure to thoroughly review this code to gain a complete understanding.

`class MultiTaskNet(nn.Module):`

def __init__(self):

super(MultiTaskNet, self).__init__()

# Two Shared Hidden Layer (Parameters in this layer learns general nature of the input and its relationship with the output)

self.shared_fc1 = nn.Linear(12, 32)

self.shared_fc2 = nn.Linear(32, 64)

self.thal_fc1 = nn.Linear(64, 32)

self.thal_fc2 = nn.Linear(32, 3) # 3 classes for thalassemia

self.heart_fc1 = nn.Linear(64, 16)

self.heart_fc2 = nn.Linear(16, 1) # 1 output for heart disease

def forward(self, x):

x = F.relu(self.shared_fc1(x))

x = F.relu(self.shared_fc2(x))

thal_out = F.relu(self.thal_fc1(x))

thal_out = self.thal_fc2(thal_out) # Task 1: Predicting thalassemia

heart_out = F.relu(self.heart_fc1(x))

heart_out = torch.sigmoid(self.heart_fc2(heart_out)) # Task 2: Predicting heart disease

return thal_out, heart_out

model = MultiTaskNet()

-----------------------------------------------------------------------------------------------

# Cost function

criterion_thal = nn.CrossEntropyLoss() # Multi Class- Softmax activation

criterion_heart = nn.BCELoss() # Binary Loss- Sigmoid activation

#Optimizers

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training Loop

num_epochs = 50

for epoch in range(num_epochs):

model.train()

running_loss_thal = 0.0

running_loss_heart = 0.0

for inputs, labels_thal, labels_heart in train_loader:

optimizer.zero_grad() # Making the optimizer has no slope (zero_grade)

outputs_thal, outputs_heart = model(inputs)

loss_thal = criterion_thal(outputs_thal, labels_thal)

loss_heart = criterion_heart(outputs_heart.squeeze(), labels_heart)

loss = loss_thal + loss_heart

loss.backward() #Calculates the slope or gradients

optimizer.step() # Updating gradients

running_loss_thal += loss_thal.item()

running_loss_heart += loss_heart.item()

if epoch%10==0:

print(f'Epoch {epoch+1}/{num_epochs}, Loss Thal: {running_loss_thal/len(train_loader)}, Loss Heart: {running_loss_heart/len(train_loader)}')

## Multi-Task Learning Visual Workflow

Before we dive into the workflow of our Multi-Task Learning project, I want to clarify an important point about how neural networks operate. You may have learned about Multi-Layer Perceptrons (MLPs) or Artificial Neural Networks (ANNs) from a neuron-centric perspective, where each neuron performs a series of operations on the input data, such as multiplying it by weights and adding bias.

However, in my articles, I’ve presented this operation in a different way — one that I believe is more accurate and intuitive. The truth is, a neuron’s operation can be thought of as a simple matrix multiplication of the input vectors and weights, followed by the addition of a bias vector. This perspective can help simplify the complex workings of neural networks and make them easier to understand. If you are not familiar, It is okay, this article will give you an in-depth understanding of how really neural network works.

Also, I want to include one more evidence, why it is a Matrix multiplication. From Image 2 You can understand the Linear layer operation. Where x is the input vector (12 independent variables in our case), A is the Weight matrix and b is the bias vector. For simplicity, I didn’t use the bias vector in this workflow.

Note:The numbers and calculations in the images are for illustration purposes only. It will help you to understand the workflow.

I have taken 32 as batch size and we have 12 independent variables as shown in Image 3, the input data A is multiplied by 1st Hidden layer weight Matrix W1 (Transposed W1) with the shape of (32, 12) resulting in 1st Hidden layer output (O1) with the shape of (32, 32). The 1st Hidden layer has 32 Neurons, Which means each vector will have 32 features now (From 12 to 32). The ReLU activation was also applied.

“If you Look Closely the neuron operation and this matrix multiplication is the same.”

The 1st Hidden layer output O1 (32,32) is then multiplied by the 2nd hidden layer weight matrix W2(Transposed W2) with the shape of (64,32) resulting in 2nd Hidden layer output O2 with the shape of (32, 64) (ReLU activation applied). In 2nd Layer we have 64 neurons, so 64 features for all 32 vectors.

Till now we have seen a Shared hidden layer. Here we will see a task-specific hidden layer.

We have seen in architecture (Image 1) and Code that the thalassemia prediction task-specific hidden layer has 32 neurons (Width), and the Heart Disease prediction task-specific hidden layer has 16 neurons. So the output O2 is Multiplied with Two weight Matrices here.

The O2 is multiplied by the thalassemia prediction task hidden layer weight matrix W31 (Transposed W31) with the shape of (32,64) resulting in Output O31, with the shape of (32,32). The same O2 is multiplied by the Heart Disease Prediction task hidden layer weight matrix W32 (Transposed W32) resulting in Output O32 as shown in Image 2.

First, we will look into the Heart Disease (Task 2) output layer. Here we have only one Neuron in the output layer. The Output O32 is Multiplied by the Heart disease task output layer weight matrix W42 (Transposed W42) with the shape of (1,16). This results in Output Logits with the shape of (32,1). We have taken 32 as batch size, For these 32 records we got the logits score in the final layer as shown in image 6.

The Output Logits for 32 records are applied with sigmoid activation [0,1], Which converts the Logits into probability scores. Then these probabilities are rounded or using threshold values it converted into outputs as shown in image 7.

For example, the First 3 patients have Heart Disease and 4th patient doesn’t have heart disease as shown in Image 7. (Illustration Purpose only). Deep Learning is simple Guys, But to understand this you have to know the basics like optimizers, activation functions, gradients, Cost functions, Backward propagation, etc. As you know we have just explored the working mechanism, not fundamental concepts. Ok, now let’s look at the task 1 (multi-class classification).

The Output O31 is multiplied by Task 1 output layer weight matrix W41(Transpose W41) with the shape of (3,32), resulting in a Multi-Class Output logits Matrix with the shape of (32,3).

I didn’t show the logits matrix here. Softmax activation (Each row) is then applied to these Logits that give the probability score for 3 classes to each record. From these probability scores, we can get the output, Whichever class has a high probability score will be the predicted output.

From this, you can understand how the Cost function and Loss Calculation works. We have seen many weight matrices. They all are parameters, and they get updated during training.

“By Understanding these DeepLearning Concepts very well we can use it based on our use case and requirements.” To illustrate this point, I want to give you One article link, Where I have explained how we can use Image features and Text features to achieve our use case.

## From Pixels to Words: How Model Understands? 🤝🤝

### From the pixels of images to the words of the language, explore how multimodal AI models bridge diverse data types through…

pub.towardsai.net

## The Role of Activation Functions in Neural Networks: With and Without Activation

What does the activation function do? Activations Functions Introduce Non-Linearity 😁 into the Neural Network. This allows the network to learn from the errors and to capture complex patterns in the data. Without the Activation function, the neural network won’t be able to learn complex relationships in the data. Neural Networks can only learn linear relationships in the data without activation.

In this article, First, we will explore the ReLU and Leaky ReLU activation functions, then MLP that uses LeakyReLU as its activation function with code and output, and then MLP without activation with code and output. Then we will explore what happens when we don’t use activation in the Neural Network.

## ReLU (Rectified Linear Unit) and Leaky ReLU

What do ReLU and LeakyReLU actually do? ReLU is a simple yet effective function that converts all negative numbers into zero. In other words, it ensures that there are no negative numbers in the output of the linear layer (neurons). ReLU’s Range is [0, infinity). ReLU solves the Vanishing Gradient problem and computation is also efficient. But ReLU has one Important drawback also, “Dying Neuron”.

As I said earlier, ReLU converts negative numbers to zero. so neurons stop learning if they enter a state of always outputting zero (Zero Gradient). To solve this problem, Leaky ReLU was proposed. LeakyReLU takes a slightly different approach. Instead of completely eliminating negative values, it downscales them to a very small value, allowing a tiny amount of the negative signal to pass through (small Non-Zero Gradients).

As shown in Image 2, When the non-zero values are multiplied by 0.01, it will give a very small value. It will prevent the dying neuron problem. So, the Weights in the Neurons can learn.

## The Problem/Dataset

In our dataset, we have only one independent variable (x), and our goal is to predict the target variable y. Interestingly, this was an interview assessment question I faced just two months ago. Unfortunately, I didn’t get selected 😂. However, this example is perfect for illustrating the importance of non-linearity, which is exactly what we’re going to explore.

If we take a closer look at the dataset by plotting it, we can see that it has many curves and complex patterns.

I trained a model with and without an activation function and used the test data to make predictions. We are gonna explore the results that I got from training and Prediction.

## MLP with Activation (Leaky ReLU)

I have given you the Kaggle Notebook link in the Reference, where you can access the complete code. Here we will focus only on MLP code and Output. If you wonder, if MLP architecture seems complex or has redundant layers, I will tell you the reason at the end [ I have tried 3 Architectures — simple, medium complex, and complex. But This will help us understand the problem].

`class ANN(nn.Module):`

def __init__(self):

super(ANN, self).__init__()

self.fc1 = nn.Linear(1, 16)

self.fc2 = nn.Linear(16, 32)

self.fc3 = nn.Linear(32, 128)

self.fc4 = nn.Linear(128, 32)

self.fc5 = nn.Linear(32, 16)

self.fc6 = nn.Linear(16,1)

#self.relu = nn.ReLU()

self.relu = nn.LeakyReLU(negative_slope=0.01)

self.bn0 = nn.BatchNorm1d(16)

self.bn = nn.BatchNorm1d(32)

self.bn1 = nn.BatchNorm1d(128)

def forward(self, x):

x = self.bn0(self.relu(self.fc1(x)))

x = self.bn(self.relu(self.fc2(x)))

x = self.bn1(self.relu(self.fc3(x)))

x = self.bn(self.relu(self.fc4(x)))

x = self.bn0(self.relu(self.fc5(x)))

x = self.fc6(x)

return x

Prediction

MSE and R1 Scores:

`Mean Squared Error: 0.15280713140964508`

R2 Score: 0.838662102679086

## MLP without Activation

I didn’t change much in this architecture, I just removed the Activation Function.

`class ANN_without_activation(nn.Module):`

def __init__(self):

super(ANN_without_activation, self).__init__()

self.fc1 = nn.Linear(1, 16)

self.fc2 = nn.Linear(16, 32)

self.fc3 = nn.Linear(32, 128)

self.fc4 = nn.Linear(128, 32)

self.fc5 = nn.Linear(32, 16)

self.fc6 = nn.Linear(16,1)

#self.relu = nn.ReLU()

#self.relu = nn.LeakyReLU(negative_slope=0.01)

self.bn0 = nn.BatchNorm1d(16)

self.bn = nn.BatchNorm1d(32)

self.bn1 = nn.BatchNorm1d(128)

def forward(self, x):

x = self.bn0(self.fc1(x))

x = self.bn(self.fc2(x))

x = self.bn1(self.fc3(x))

x = self.bn(self.fc4(x))

x = self.bn0(self.fc5(x))

x = self.fc6(x)

return x

## Prediction

MSE and R1 Scores:

`Mean Squared Error: 0.9459459185600281`

R2 Score: 0.0012448664167900025

## What happens When we don’t use Activation Function?

By starting from Metrics, there is a huge loss and the R1 score is very very low in the Model that we didn’t use the activation function. What really happening here?

Let’s take a closer look at the below equations:

Image 6: A is the Input vector (x). W is the Linear Layer weight matrix. B is the bias vector. This Image gives the 3 Linear Layer operation without activation.

What you can understand from these equations? In simple terms, if we don’t apply any activation function, our output will merely be a linear combination of the input data. Because the 1st linear Layer output (Z1) is the input of the 2nd Linear Layer. This means our Neural Network (MLP) won’t be capable of learning any non-linear relationships between input and output. Essentially, it won’t be able to capture the complexities often present in real-world data. Without activation functions, our model won’t be creating any curves or intricate patterns to understand the data better. By Using Activation we bend the linear line.

For example, Training Model to classify images, the Model should have a generalized Understanding of the Image Features. Without activation, the Model cannot able to learn features completely. The Model approximation of Image features will be very poor.

To illustrate this point more clearly. let’s take a look at the below image

I also trained the Linear Regression Model with this dataset. The Output of the Linear Regression and MLP without activation function is the same. Linear Regression can learn relationships and patterns in the data only at a certain level (Linear Level).

You Know I have used Many Linear Layers without activation, Even though we have Many Linear Layers (We have approximately 9500 parameters (Both weights and Biases) in our MLP, But the model is not able to learn complex patterns without activation function) the MLP without activation function can’t learn complex patterns in the data.

Activation Functions are very important to learn complex patterns in the real work function. I have explained the need for an activation function. The choice of activation function can significantly affect the performance and convergence of the network. So Based on the nature of the data and target, People choose the activation function.

**Thanks** for reading this article 🤩. If you found it **useful **👍, don’t forget to give a few **Clapssss**👏 (+50 ). Feel free to **follow for more insights **😉.

Let’s stay connected and explore the exciting world of AI together!

Join me on **LinkedIn **👨💻**: ****linkedin.com/in/jaiganesan-n/**** **🌍❤️

## References:

[1] Multi-Task Learning Kaggle Implementation Notebook

[3] Neural Network/ Multi-Layer Perceptron (MLP) Working (My article)

[4] MLP (Activation) Implementation Kaggle Notebook

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI