What Stops Neural Networks from Becoming Linear Models

Last Updated on May 27, 2026 by Editorial Team

Author(s): Nelson Cruz

Originally published on Towards AI.

What Stops Neural Networks from Becoming Linear Models

Deep neural networks are built from surprisingly simple mathematical components.

One of the most important is the activation function — the mechanism that allows neural networks to escape linearity and model complex patterns.

Without activation functions, even extremely deep networks would collapse into simple linear transformations.

And interestingly, the evolution of activation functions mirrors the evolution of deep learning itself:

Sigmoid dominated early neural networks
ReLU helped unlock the deep learning revolution
GELU became part of the Transformer era powering modern LLMs

Understanding activation functions is not just about memorizing formulas.
It is about understanding why deep learning works at all and escapes linearity.

What Is an Activation Function?

At its core, a neural network neuron performs two operations:

A weighted linear transformation
A non-linear transformation

Mathematically, a neuron can be represented as:

What Stops Neural Networks from Becoming Linear Models

Then the activation function transforms this value:

Or more compactly:

Where:

x , is the input
W , represents learned weights
b , is the bias
f , is the activation function

The linear part alone is not enough to model complex behavior.

The activation function is what introduces non-linearity into the network.

Why Non-Linearity Changes Everything

This is one of the most important ideas in deep learning.

Suppose we stack multiple layers together, but every layer is purely linear:

Even though this looks deep, the entire operation can still be simplified into a single linear transformation:

That means:

A deep neural network without activation functions is mathematically equivalent to a shallow linear model.

This is a massive limitation.

Linear models can only create straight decision boundaries.

But real-world problems are rarely linear:

images
language
speech

and so on, all contain highly non-linear relationships.

Activation functions allow neural networks to bend and reshape the decision space.

A useful way to think about it is:

Activation functions give neural networks the ability to model curved and complex relationships instead of just straight lines.

This is why non-linearity is fundamental to modern AI.

The Evolution of Activation Functions

Interestingly, the history of activation functions closely follows the evolution of deep learning architectures themselves.

Different eras of AI favored different activation functions because they solved different optimization challenges.

Sigmoid — The Early Neural Network Era

For many years, the sigmoid function was one of the dominant activation functions in neural networks.

Its formula is:

The sigmoid curve smoothly compresses any input into a value between 0 and 1.

This made it attractive because:

outputs resemble probabilities
it is smooth and differentiable
it was loosely inspired by biological neuron activation

Visually, the function behaves like an “S-shaped” curve:

very negative inputs approach 0
very positive inputs approach 1
values near 0 remain more sensitive

This worked reasonably well in shallow neural networks.

But as networks became deeper, a major problem appeared.

The Vanishing Gradient Problem

Training neural networks relies on gradients flowing backward through the network during backpropagation.

The sigmoid derivative is:

The important insight is not the formula itself.

The important insight is that the derivative becomes very small when the neuron saturates near 0 or 1.

During backpropagation, many small gradients get multiplied together across layers.

As a result:

gradients shrink
updates become tiny
early layers learn extremely slowly

This phenomenon became known as the vanishing gradient problem.

As neural networks became deeper, sigmoid increasingly struggled and Deep learning needed something better.

ReLU — The Deep Learning Revolution

Then came one of the simplest — yet most influential — activation functions ever introduced: ReLU (Rectified Linear Unit)

Its formula is almost absurdly simple:

Negative values become zero and positive values pass through unchanged. That’s it.

And yet this simple function helped transform deep learning.

Why ReLU Changed Deep Learning

ReLU solved several important optimization problems simultaneously.

1. Better Gradient Flow

Unlike sigmoid, ReLU does not saturate for positive values.

This helps gradients propagate much more effectively through deep networks. So training became faster, more stable and more scalable.

2. Computational Simplicity

ReLU is extremely cheap to compute. It only requires a threshold operation so we don’t have any exponentials or expensive operations.

This became important as models and datasets grew larger.

3. Sparse Activations

Because negative values become zero, many neurons remain inactive for certain inputs.

This creates sparse activations, which can sometimes improve efficiency and representation learning.

ReLU and the Rise of Modern Deep Learning

ReLU became especially important during the rise of large convolutional neural networks.

Architectures associated with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) demonstrated that deep networks trained with ReLU could dramatically outperform previous approaches.

This period helped trigger the modern deep learning revolution led by researchers such as Geoffrey Hinton and others. For many years, ReLU became the default activation function across deep learning.

But eventually, newer architectures introduced new requirements.

GELU — The Transformer Era

Modern Transformer architectures often use a different activation function: GELU (Gaussian Error Linear Unit).

Its formula is:

Where Φ(x) represents the cumulative Gaussian distribution.

At first glance, GELU may look more complicated than ReLU. But conceptually, the idea is elegant.

Instead of abruptly removing negative values like ReLU, GELU smoothly weights inputs according to their importance.

You can think of GELU as a softer and more probabilistic version of ReLU.

Why GELU Works Well in Transformers

Transformers process extremely rich contextual embeddings.

Small variations in token representations can carry important semantic meaning.

Because GELU behaves more smoothly than ReLU:

subtle information is preserved better
optimization becomes smoother
representations remain richer

This became particularly useful in architectures such as BERT from Google and GPT-style models from OpenAI.

Today, GELU is strongly associated with the Transformer era of deep learning.

Visualizing Activation Functions in Python

Activation functions become much easier to understand when we visualize them.

The following code compares Sigmoid, ReLU, and GELU.

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf

x = np.linspace(-5, 5, 500)

def sigmoid(x):
 return 1 / (1 + np.exp(-x))

def relu(x):
 return np.maximum(0, x)

def gelu(x):
 return 0.5 * x * (1 + erf(x / np.sqrt(2)))

plt.figure(figsize=(10, 6))

plt.plot(x, sigmoid(x), label="Sigmoid")
plt.plot(x, relu(x), label="ReLU")
plt.plot(x, gelu(x), label="GELU")

plt.axhline(0, linestyle="--", linewidth=0.8)
plt.axvline(0, linestyle="--", linewidth=0.8)

plt.title("Activation Functions")
plt.xlabel("Input")
plt.ylabel("Output")
plt.legend()
plt.grid(True)

plt.show()

Several important behaviors become immediately visible:

Sigmoid saturates at extreme values
ReLU creates hard thresholds
GELU behaves more smoothly

This simple visualization already explains much of the historical evolution of activation functions.

Why Non-Linearity Matters in Practice

Let’s now build a simple non-linear classification problem.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

n_samples = 500

X = np.random.uniform(-1, 1, size=(n_samples, 2))

radius = np.sqrt(X[:, 0]**2 + X[:, 1]**2)

y = (radius > 0.5).astype(int)

plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.7)

plt.title("Non-Linear Classification Problem")
plt.xlabel("x1")
plt.ylabel("x2")
plt.grid(True)

plt.show()

A linear model would struggle with this dataset because the decision boundary is circular.

A straight line cannot separate the classes properly.

This is exactly where activation functions become essential.

A Small Neural Network Example

Now let’s train a small neural network using PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

class SmallNeuralNetwork(nn.Module):

 def __init__(self):
 super().__init__()

 self.network = nn.Sequential(
 nn.Linear(2, 16),
 nn.ReLU(),

 nn.Linear(16, 16),
 nn.ReLU(),

 nn.Linear(16, 2)
 )

 def forward(self, x):
 return self.network(x)

model = SmallNeuralNetwork()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):

 predictions = model(X_tensor)

 loss = criterion(predictions, y_tensor)

 optimizer.zero_grad()

 loss.backward()

 optimizer.step()

 if epoch % 100 == 0:

 predicted_classes = torch.argmax(predictions, axis=1)

 accuracy = (
 predicted_classes == y_tensor
 ).float().mean()

 print(
 f"Epoch {epoch}, "
 f"Loss: {loss.item():.4f}, "
 f"Accuracy: {accuracy.item():.4f}"
 )

The critical component is not only the linear layers:

nn.Linear(...)

It is the activation functions inserted between them:

nn.ReLU()

Without activation functions, stacking multiple linear layers would still produce a linear model.

With activation functions, the network can progressively reshape the feature space into highly complex decision boundaries.

Softmax — More Than Just an Activation Function

Many explanations treat Softmax as “just another activation function.”

But Softmax plays a very different role.

Its formula is:

Softmax converts raw outputs (called logits) into a probability distribution.

The outputs:

become positive
sum to 1
can be interpreted as probabilities

This makes Softmax especially useful for multi-class classification.

Why Softmax Matters in Transformers

Softmax became even more important with the rise of Transformers.

In attention mechanisms, Softmax is used to transform similarity scores into attention weights.

The attention equation is:

Conceptually:

tokens compute similarity scores with one another
Softmax normalizes these scores
the model decides how much attention each token should receive

This mechanism is fundamental to modern LLMs, Softmax is therefore not merely an activation function. It is a core probabilistic mechanism behind modern attention systems.

Practical Engineering Takeaways

Different activation functions are useful for different reasons.

Binary classification output =>Sigmoid activation function
Deep CNN hidden layers => ReLU
Transformers and LLMs => GELU
Multi-class output layer => Softmax

Activation functions directly affect:

gradient propagation
convergence speed
training stability
computational efficiency
representation quality

Choosing an activation function is partly about mathematics — but also about optimization behavior and architecture design.

Final Thoughts

Activation functions are some of the simplest mathematical components inside neural networks.

But they fundamentally define what a model is capable of learning.

The evolution from Sigmoid to ReLU to GELU closely mirrors the evolution of deep learning itself.

Without activation functions deep networks would collapse into linear systems, modern computer vision would struggle and Transformers and LLMs would not work the way they do today

In many ways, activation functions are one of the hidden engines behind modern AI.

If this article helped you better understand how neural networks actually learn complex behavior, feel free to connect or share your thoughts.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

What Stops Neural Networks from Becoming Linear Models

Author(s): Nelson Cruz

What Stops Neural Networks from Becoming Linear Models

What Is an Activation Function?

Why Non-Linearity Changes Everything

The Evolution of Activation Functions

Sigmoid — The Early Neural Network Era

The Vanishing Gradient Problem

ReLU — The Deep Learning Revolution

Why ReLU Changed Deep Learning

ReLU and the Rise of Modern Deep Learning

GELU — The Transformer Era

Why GELU Works Well in Transformers

Visualizing Activation Functions in Python

Why Non-Linearity Matters in Practice

A Small Neural Network Example

Softmax — More Than Just an Activation Function

Why Softmax Matters in Transformers

Practical Engineering Takeaways

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

What Stops Neural Networks from Becoming Linear Models

Author(s): Nelson Cruz

What Stops Neural Networks from Becoming Linear Models

What Is an Activation Function?

Why Non-Linearity Changes Everything

The Evolution of Activation Functions

Sigmoid — The Early Neural Network Era

The Vanishing Gradient Problem

ReLU — The Deep Learning Revolution

Why ReLU Changed Deep Learning

ReLU and the Rise of Modern Deep Learning

GELU — The Transformer Era

Why GELU Works Well in Transformers

Visualizing Activation Functions in Python

Why Non-Linearity Matters in Practice

A Small Neural Network Example

Softmax — More Than Just an Activation Function

Why Softmax Matters in Transformers

Practical Engineering Takeaways

Final Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement