What Stops Neural Networks from Becoming Linear Models
Last Updated on May 27, 2026 by Editorial Team
Author(s): Nelson Cruz
Originally published on Towards AI.
What Stops Neural Networks from Becoming Linear Models
Deep neural networks are built from surprisingly simple mathematical components.
One of the most important is the activation function — the mechanism that allows neural networks to escape linearity and model complex patterns.
Without activation functions, even extremely deep networks would collapse into simple linear transformations.
And interestingly, the evolution of activation functions mirrors the evolution of deep learning itself:
- Sigmoid dominated early neural networks
- ReLU helped unlock the deep learning revolution
- GELU became part of the Transformer era powering modern LLMs
Understanding activation functions is not just about memorizing formulas.
It is about understanding why deep learning works at all and escapes linearity.
What Is an Activation Function?
At its core, a neural network neuron performs two operations:
- A weighted linear transformation
- A non-linear transformation
Mathematically, a neuron can be represented as:

Then the activation function transforms this value:

Or more compactly:

Where:
- x , is the input
- W , represents learned weights
- b , is the bias
- f , is the activation function
The linear part alone is not enough to model complex behavior.
The activation function is what introduces non-linearity into the network.
Why Non-Linearity Changes Everything
This is one of the most important ideas in deep learning.
Suppose we stack multiple layers together, but every layer is purely linear:

Even though this looks deep, the entire operation can still be simplified into a single linear transformation:

That means:
A deep neural network without activation functions is mathematically equivalent to a shallow linear model.
This is a massive limitation.
Linear models can only create straight decision boundaries.
But real-world problems are rarely linear:
- images
- language
- speech
and so on, all contain highly non-linear relationships.
Activation functions allow neural networks to bend and reshape the decision space.
A useful way to think about it is:
Activation functions give neural networks the ability to model curved and complex relationships instead of just straight lines.
This is why non-linearity is fundamental to modern AI.
The Evolution of Activation Functions
Interestingly, the history of activation functions closely follows the evolution of deep learning architectures themselves.
Different eras of AI favored different activation functions because they solved different optimization challenges.
Sigmoid — The Early Neural Network Era
For many years, the sigmoid function was one of the dominant activation functions in neural networks.
Its formula is:

The sigmoid curve smoothly compresses any input into a value between 0 and 1.

This made it attractive because:
- outputs resemble probabilities
- it is smooth and differentiable
- it was loosely inspired by biological neuron activation
Visually, the function behaves like an “S-shaped” curve:
- very negative inputs approach 0
- very positive inputs approach 1
- values near 0 remain more sensitive
This worked reasonably well in shallow neural networks.
But as networks became deeper, a major problem appeared.
The Vanishing Gradient Problem
Training neural networks relies on gradients flowing backward through the network during backpropagation.
The sigmoid derivative is:

The important insight is not the formula itself.
The important insight is that the derivative becomes very small when the neuron saturates near 0 or 1.
During backpropagation, many small gradients get multiplied together across layers.
As a result:
- gradients shrink
- updates become tiny
- early layers learn extremely slowly
This phenomenon became known as the vanishing gradient problem.
As neural networks became deeper, sigmoid increasingly struggled and Deep learning needed something better.
ReLU — The Deep Learning Revolution
Then came one of the simplest — yet most influential — activation functions ever introduced: ReLU (Rectified Linear Unit)
Its formula is almost absurdly simple:

Negative values become zero and positive values pass through unchanged. That’s it.

And yet this simple function helped transform deep learning.
Why ReLU Changed Deep Learning
ReLU solved several important optimization problems simultaneously.
1. Better Gradient Flow
Unlike sigmoid, ReLU does not saturate for positive values.
This helps gradients propagate much more effectively through deep networks. So training became faster, more stable and more scalable.
2. Computational Simplicity
ReLU is extremely cheap to compute. It only requires a threshold operation so we don’t have any exponentials or expensive operations.
This became important as models and datasets grew larger.
3. Sparse Activations
Because negative values become zero, many neurons remain inactive for certain inputs.
This creates sparse activations, which can sometimes improve efficiency and representation learning.
ReLU and the Rise of Modern Deep Learning
ReLU became especially important during the rise of large convolutional neural networks.
Architectures associated with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) demonstrated that deep networks trained with ReLU could dramatically outperform previous approaches.
This period helped trigger the modern deep learning revolution led by researchers such as Geoffrey Hinton and others. For many years, ReLU became the default activation function across deep learning.
But eventually, newer architectures introduced new requirements.
GELU — The Transformer Era
Modern Transformer architectures often use a different activation function: GELU (Gaussian Error Linear Unit).
Its formula is:

Where Φ(x) represents the cumulative Gaussian distribution.

At first glance, GELU may look more complicated than ReLU. But conceptually, the idea is elegant.
Instead of abruptly removing negative values like ReLU, GELU smoothly weights inputs according to their importance.
You can think of GELU as a softer and more probabilistic version of ReLU.
Why GELU Works Well in Transformers
Transformers process extremely rich contextual embeddings.
Small variations in token representations can carry important semantic meaning.
Because GELU behaves more smoothly than ReLU:
- subtle information is preserved better
- optimization becomes smoother
- representations remain richer
This became particularly useful in architectures such as BERT from Google and GPT-style models from OpenAI.
Today, GELU is strongly associated with the Transformer era of deep learning.
Visualizing Activation Functions in Python
Activation functions become much easier to understand when we visualize them.
The following code compares Sigmoid, ReLU, and GELU.
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf
x = np.linspace(-5, 5, 500)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def relu(x):
return np.maximum(0, x)
def gelu(x):
return 0.5 * x * (1 + erf(x / np.sqrt(2)))
plt.figure(figsize=(10, 6))
plt.plot(x, sigmoid(x), label="Sigmoid")
plt.plot(x, relu(x), label="ReLU")
plt.plot(x, gelu(x), label="GELU")
plt.axhline(0, linestyle="--", linewidth=0.8)
plt.axvline(0, linestyle="--", linewidth=0.8)
plt.title("Activation Functions")
plt.xlabel("Input")
plt.ylabel("Output")
plt.legend()
plt.grid(True)
plt.show()

Several important behaviors become immediately visible:
- Sigmoid saturates at extreme values
- ReLU creates hard thresholds
- GELU behaves more smoothly
This simple visualization already explains much of the historical evolution of activation functions.
Why Non-Linearity Matters in Practice
Let’s now build a simple non-linear classification problem.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 500
X = np.random.uniform(-1, 1, size=(n_samples, 2))
radius = np.sqrt(X[:, 0]**2 + X[:, 1]**2)
y = (radius > 0.5).astype(int)
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.7)
plt.title("Non-Linear Classification Problem")
plt.xlabel("x1")
plt.ylabel("x2")
plt.grid(True)
plt.show()

A linear model would struggle with this dataset because the decision boundary is circular.
A straight line cannot separate the classes properly.
This is exactly where activation functions become essential.
A Small Neural Network Example
Now let’s train a small neural network using PyTorch.
import torch
import torch.nn as nn
import torch.optim as optim
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)
class SmallNeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.network = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),
nn.Linear(16, 16),
nn.ReLU(),
nn.Linear(16, 2)
)
def forward(self, x):
return self.network(x)
model = SmallNeuralNetwork()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
for epoch in range(500):
predictions = model(X_tensor)
loss = criterion(predictions, y_tensor)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 100 == 0:
predicted_classes = torch.argmax(predictions, axis=1)
accuracy = (
predicted_classes == y_tensor
).float().mean()
print(
f"Epoch {epoch}, "
f"Loss: {loss.item():.4f}, "
f"Accuracy: {accuracy.item():.4f}"
)
The critical component is not only the linear layers:
nn.Linear(...)
It is the activation functions inserted between them:
nn.ReLU()
Without activation functions, stacking multiple linear layers would still produce a linear model.
With activation functions, the network can progressively reshape the feature space into highly complex decision boundaries.

Softmax — More Than Just an Activation Function
Many explanations treat Softmax as “just another activation function.”
But Softmax plays a very different role.
Its formula is:

Softmax converts raw outputs (called logits) into a probability distribution.
The outputs:
- become positive
- sum to 1
- can be interpreted as probabilities
This makes Softmax especially useful for multi-class classification.
Why Softmax Matters in Transformers
Softmax became even more important with the rise of Transformers.
In attention mechanisms, Softmax is used to transform similarity scores into attention weights.
The attention equation is:

Conceptually:
- tokens compute similarity scores with one another
- Softmax normalizes these scores
- the model decides how much attention each token should receive
This mechanism is fundamental to modern LLMs, Softmax is therefore not merely an activation function. It is a core probabilistic mechanism behind modern attention systems.
Practical Engineering Takeaways
Different activation functions are useful for different reasons.
- Binary classification output =>Sigmoid activation function
- Deep CNN hidden layers => ReLU
- Transformers and LLMs => GELU
- Multi-class output layer => Softmax
Activation functions directly affect:
- gradient propagation
- convergence speed
- training stability
- computational efficiency
- representation quality
Choosing an activation function is partly about mathematics — but also about optimization behavior and architecture design.
Final Thoughts
Activation functions are some of the simplest mathematical components inside neural networks.
But they fundamentally define what a model is capable of learning.
The evolution from Sigmoid to ReLU to GELU closely mirrors the evolution of deep learning itself.
Without activation functions deep networks would collapse into linear systems, modern computer vision would struggle and Transformers and LLMs would not work the way they do today
In many ways, activation functions are one of the hidden engines behind modern AI.
If this article helped you better understand how neural networks actually learn complex behavior, feel free to connect or share your thoughts.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.