Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
What Stops Neural Networks from Becoming Linear Models
Artificial Intelligence   Latest   Machine Learning

What Stops Neural Networks from Becoming Linear Models

Last Updated on May 27, 2026 by Editorial Team

Author(s): Nelson Cruz

Originally published on Towards AI.

What Stops Neural Networks from Becoming Linear Models

Deep neural networks are built from surprisingly simple mathematical components.

One of the most important is the activation function — the mechanism that allows neural networks to escape linearity and model complex patterns.

Without activation functions, even extremely deep networks would collapse into simple linear transformations.

And interestingly, the evolution of activation functions mirrors the evolution of deep learning itself:

  • Sigmoid dominated early neural networks
  • ReLU helped unlock the deep learning revolution
  • GELU became part of the Transformer era powering modern LLMs

Understanding activation functions is not just about memorizing formulas.
It is about understanding why deep learning works at all and escapes linearity.

What Is an Activation Function?

At its core, a neural network neuron performs two operations:

  1. A weighted linear transformation
  2. A non-linear transformation

Mathematically, a neuron can be represented as:

What Stops Neural Networks from Becoming Linear Models

Then the activation function transforms this value:

Or more compactly:

Where:

  • x , is the input
  • W , represents learned weights
  • b , is the bias
  • f , is the activation function

The linear part alone is not enough to model complex behavior.

The activation function is what introduces non-linearity into the network.

Why Non-Linearity Changes Everything

This is one of the most important ideas in deep learning.

Suppose we stack multiple layers together, but every layer is purely linear:

Even though this looks deep, the entire operation can still be simplified into a single linear transformation:

That means:

A deep neural network without activation functions is mathematically equivalent to a shallow linear model.

This is a massive limitation.

Linear models can only create straight decision boundaries.

But real-world problems are rarely linear:

  • images
  • language
  • speech

and so on, all contain highly non-linear relationships.

Activation functions allow neural networks to bend and reshape the decision space.

A useful way to think about it is:

Activation functions give neural networks the ability to model curved and complex relationships instead of just straight lines.

This is why non-linearity is fundamental to modern AI.

The Evolution of Activation Functions

Interestingly, the history of activation functions closely follows the evolution of deep learning architectures themselves.

Different eras of AI favored different activation functions because they solved different optimization challenges.

Sigmoid — The Early Neural Network Era

For many years, the sigmoid function was one of the dominant activation functions in neural networks.

Its formula is:

The sigmoid curve smoothly compresses any input into a value between 0 and 1.

Source: Image by the author.

This made it attractive because:

  • outputs resemble probabilities
  • it is smooth and differentiable
  • it was loosely inspired by biological neuron activation

Visually, the function behaves like an “S-shaped” curve:

  • very negative inputs approach 0
  • very positive inputs approach 1
  • values near 0 remain more sensitive

This worked reasonably well in shallow neural networks.

But as networks became deeper, a major problem appeared.

The Vanishing Gradient Problem

Training neural networks relies on gradients flowing backward through the network during backpropagation.

The sigmoid derivative is:

The important insight is not the formula itself.

The important insight is that the derivative becomes very small when the neuron saturates near 0 or 1.

During backpropagation, many small gradients get multiplied together across layers.

As a result:

  • gradients shrink
  • updates become tiny
  • early layers learn extremely slowly

This phenomenon became known as the vanishing gradient problem.

As neural networks became deeper, sigmoid increasingly struggled and Deep learning needed something better.

ReLU — The Deep Learning Revolution

Then came one of the simplest — yet most influential — activation functions ever introduced: ReLU (Rectified Linear Unit)

Its formula is almost absurdly simple:

Negative values become zero and positive values pass through unchanged. That’s it.

Source: Image by the author.

And yet this simple function helped transform deep learning.

Why ReLU Changed Deep Learning

ReLU solved several important optimization problems simultaneously.

1. Better Gradient Flow

Unlike sigmoid, ReLU does not saturate for positive values.

This helps gradients propagate much more effectively through deep networks. So training became faster, more stable and more scalable.

2. Computational Simplicity

ReLU is extremely cheap to compute. It only requires a threshold operation so we don’t have any exponentials or expensive operations.

This became important as models and datasets grew larger.

Write on Medium

3. Sparse Activations

Because negative values become zero, many neurons remain inactive for certain inputs.

This creates sparse activations, which can sometimes improve efficiency and representation learning.

ReLU and the Rise of Modern Deep Learning

ReLU became especially important during the rise of large convolutional neural networks.

Architectures associated with the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) demonstrated that deep networks trained with ReLU could dramatically outperform previous approaches.

This period helped trigger the modern deep learning revolution led by researchers such as Geoffrey Hinton and others. For many years, ReLU became the default activation function across deep learning.

But eventually, newer architectures introduced new requirements.

GELU — The Transformer Era

Modern Transformer architectures often use a different activation function: GELU (Gaussian Error Linear Unit).

Its formula is:

Where Φ(x) represents the cumulative Gaussian distribution.

Source: Image by the author.

At first glance, GELU may look more complicated than ReLU. But conceptually, the idea is elegant.

Instead of abruptly removing negative values like ReLU, GELU smoothly weights inputs according to their importance.

You can think of GELU as a softer and more probabilistic version of ReLU.

Why GELU Works Well in Transformers

Transformers process extremely rich contextual embeddings.

Small variations in token representations can carry important semantic meaning.

Because GELU behaves more smoothly than ReLU:

  • subtle information is preserved better
  • optimization becomes smoother
  • representations remain richer

This became particularly useful in architectures such as BERT from Google and GPT-style models from OpenAI.

Today, GELU is strongly associated with the Transformer era of deep learning.

Visualizing Activation Functions in Python

Activation functions become much easier to understand when we visualize them.

The following code compares Sigmoid, ReLU, and GELU.

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import erf

x = np.linspace(-5, 5, 500)

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def relu(x):
return np.maximum(0, x)

def gelu(x):
return 0.5 * x * (1 + erf(x / np.sqrt(2)))

plt.figure(figsize=(10, 6))

plt.plot(x, sigmoid(x), label="Sigmoid")
plt.plot(x, relu(x), label="ReLU")
plt.plot(x, gelu(x), label="GELU")

plt.axhline(0, linestyle="--", linewidth=0.8)
plt.axvline(0, linestyle="--", linewidth=0.8)

plt.title("Activation Functions")
plt.xlabel("Input")
plt.ylabel("Output")
plt.legend()
plt.grid(True)

plt.show()
Source: Image by the author.

Several important behaviors become immediately visible:

  • Sigmoid saturates at extreme values
  • ReLU creates hard thresholds
  • GELU behaves more smoothly

This simple visualization already explains much of the historical evolution of activation functions.

Why Non-Linearity Matters in Practice

Let’s now build a simple non-linear classification problem.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

n_samples = 500

X = np.random.uniform(-1, 1, size=(n_samples, 2))

radius = np.sqrt(X[:, 0]**2 + X[:, 1]**2)

y = (radius > 0.5).astype(int)

plt.figure(figsize=(6, 6))

plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.7)

plt.title("Non-Linear Classification Problem")
plt.xlabel("x1")
plt.ylabel("x2")
plt.grid(True)

plt.show()
Source: Image by the author.

A linear model would struggle with this dataset because the decision boundary is circular.

A straight line cannot separate the classes properly.

This is exactly where activation functions become essential.

A Small Neural Network Example

Now let’s train a small neural network using PyTorch.

import torch
import torch.nn as nn
import torch.optim as optim

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

class SmallNeuralNetwork(nn.Module):

def __init__(self):
super().__init__()

self.network = nn.Sequential(
nn.Linear(2, 16),
nn.ReLU(),

nn.Linear(16, 16),
nn.ReLU(),

nn.Linear(16, 2)
)

def forward(self, x):
return self.network(x)

model = SmallNeuralNetwork()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):

predictions = model(X_tensor)

loss = criterion(predictions, y_tensor)

optimizer.zero_grad()

loss.backward()

optimizer.step()

if epoch % 100 == 0:

predicted_classes = torch.argmax(predictions, axis=1)

accuracy = (
predicted_classes == y_tensor
).float().mean()

print(
f"Epoch {epoch}, "
f"Loss: {loss.item():.4f}, "
f"Accuracy: {accuracy.item():.4f}"
)

The critical component is not only the linear layers:

nn.Linear(...)

It is the activation functions inserted between them:

nn.ReLU()

Without activation functions, stacking multiple linear layers would still produce a linear model.

With activation functions, the network can progressively reshape the feature space into highly complex decision boundaries.

Source: Image by the author.

Softmax — More Than Just an Activation Function

Many explanations treat Softmax as “just another activation function.”

But Softmax plays a very different role.

Its formula is:

Softmax converts raw outputs (called logits) into a probability distribution.

The outputs:

  • become positive
  • sum to 1
  • can be interpreted as probabilities

This makes Softmax especially useful for multi-class classification.

Why Softmax Matters in Transformers

Softmax became even more important with the rise of Transformers.

In attention mechanisms, Softmax is used to transform similarity scores into attention weights.

The attention equation is:

Conceptually:

  • tokens compute similarity scores with one another
  • Softmax normalizes these scores
  • the model decides how much attention each token should receive

This mechanism is fundamental to modern LLMs, Softmax is therefore not merely an activation function. It is a core probabilistic mechanism behind modern attention systems.

Practical Engineering Takeaways

Different activation functions are useful for different reasons.

  • Binary classification output =>Sigmoid activation function
  • Deep CNN hidden layers => ReLU
  • Transformers and LLMs => GELU
  • Multi-class output layer => Softmax

Activation functions directly affect:

  • gradient propagation
  • convergence speed
  • training stability
  • computational efficiency
  • representation quality

Choosing an activation function is partly about mathematics — but also about optimization behavior and architecture design.

Final Thoughts

Activation functions are some of the simplest mathematical components inside neural networks.

But they fundamentally define what a model is capable of learning.

The evolution from Sigmoid to ReLU to GELU closely mirrors the evolution of deep learning itself.

Without activation functions deep networks would collapse into linear systems, modern computer vision would struggle and Transformers and LLMs would not work the way they do today

In many ways, activation functions are one of the hidden engines behind modern AI.

If this article helped you better understand how neural networks actually learn complex behavior, feel free to connect or share your thoughts.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.