Going Beyond the 1000-Layer Convolution Network

Author(s): Bartosz Ludwiczuk

Originally published on Towards AI.

· Introduction
· Vanishing gradient issue
· Mitigation of the vanishing gradient issue
· Training 1000 layer network
· Training component analysis
· Diving Deeper into Skip Connections
· 10000-layer network

Mean gradient for 1st layer in all experiments

Introduction

One of the largest Convolutional Networks, ConvNext-XXLarge[1] from OpenCLIP[2], boasts approximately 850 million parameters and 120 layers (counting all convolutional and linear layers). This is a dramatic increase compared to the 8 layers of AlexNet[3] but still fewer than the 1001-layer experiment introduced in the PreResNet[4] paper.

Interestingly, about a decade ago, training networks with more than 100 layers was considered nearly impossible due to the vanishing gradient problem. However, advancements such as improved activation functions, normalization layers, and skip connections have significantly mitigated this issue — or so it seems. But is the problem truly solved?

In this blog post, I will explore:

What components enable training neural networks with more than 1,000 layers?
Is it possible to train a 10,000-layer Convolutional Neural Network successfully?

Vanishing gradient issue

Before diving into experiments, let’s briefly revisit the vanishing gradient problem, a challenge that many sources have already explored in detail.

The vanishing gradient problem occurs when the gradients in the early layers of a neural network become extremely small, effectively halting their ability to learn useful features. This issue arises due to the chain rule used during backpropagation, where the gradient is propagated backward from the final layer to the first. If the gradient in any layer is close to zero, the gradients for preceding layers shrink exponentially. A major cause of this behavior is the saturation of activation functions.

To illustrate this, I trained a simple 5-layer network using the sigmoid activation function, which is particularly prone to saturation. You can find the code for this experiment on GitHub. The goal was to observe how the gradient norms of the network’s weights evolve over time.

Gradient Norms Per Layer (Vanishing Gradient Issue). FC5 is the top layer, FC1 is the first layer. Image by author

The plot above shows the gradient norms for each linear layer over several training iterations. FC5 represents the final layer, while FC1 represents the first.

Vanishing Gradient Problem:

In the first training iteration, there’s a huge difference in gradient norms between FC5 and FC4, with FC4 being approximately 10x smaller.
By the time we reach FC1, the gradient is reduced by a factor of ~10,000 compared to FC5, leaving almost nothing of the original gradient to update the weights.

This is a textbook example of the vanishing gradient problem, primarily driven by activation function saturation.

Sigmoid activation function and its gradient. Plot add preactivation and activations/gradient values. Image by author

Let’s delve deeper into the root cause: the sigmoid activation function. To understand its impact, I analyzed the first layer's pre-activation values (inputs to the sigmoid). The findings:

Most pre-activation values lie in the flat regions of the sigmoid curve, resulting in activations close to 0 or 1.
In these regions, the sigmoid gradient is nearly zero, as shown in the plot above.

This means that any gradient passed backward through these layers is severely diminished, effectively disappearing by the time it reaches the first layers.
The maximum gradient of the sigmoid function is 0.25, achieved at the midpoint of the curve. Even under ideal conditions, with 5 layers, the maximum gradient diminishes to 0.25⁵≈ 1e-3. This reduction becomes catastrophic for networks with 1,000 layers, rendering negligible the first layers' gradients.

Skip connection. Source: Deep Residual Learning for Image Recognition, Kaiming He

Mitigation of the vanishing gradient issue

Several advancements have been instrumental in addressing the vanishing gradient problem, making it possible to train very deep neural networks. The key components that contribute to this solution are:

1. Activation Functions (e.g., Tanh, ReLU, GeLU)

Modern activation functions have been designed to mitigate vanishing gradients by offering higher maximum gradient values and reducing regions where the gradient is zero. For example:

ReLU (Rectified Linear Unit) has a maximum gradient of 1.0 and eliminates the saturation problem for positive inputs. This ensures gradients remain significant during backpropagation.
Other functions, such as GeLU[5] and Swish[6], smooth out the gradient landscape, further improving training stability.

2. Normalization Techniques (e.g., BatchNorm[7], LayerNorm[8])

Normalization layers play a crucial role by adjusting pre-activation values to have a mean close to zero and a consistent variance. This helps in two significant ways:

It reduces the likelihood of pre-activation values entering the saturation regions of activation functions, where gradients are nearly zero.
Normalization ensures more stable training by keeping the activations well-distributed across layers.

For instance:

BatchNorm[7] normalizes the input to each layer based on the batch statistics during training.
LayerNorm[8] normalizes across features for each sample, making it more effective in some scenarios.

3. Skip Connections (Residual Connections)

Skip connections, introduced in architectures like ResNet[9], allow input signals to bypass one or more intermediate layers by directly adding the input to the layer's output. This mechanism addresses the vanishing gradient problem by:

Providing a direct pathway for gradients to flow back to earlier layers without being multiplied by small derivatives or passed through saturating activation functions.
Preserving gradients even in very deep networks, ensuring effective learning for earlier layers.

By avoiding multiplications or transformations in the skip path, gradients remain intact, making them a simple yet powerful tool for training ultra-deep networks.

Skip connection equation. Image by author

Training 1000 layer network

For this experiment, all training was conducted on the CIFAR-10[10] dataset. The baseline architecture was ConvNext[1], chosen for its scalability and effectiveness in modern vision tasks. To define successful convergence, I used a validation accuracy of >50% (compared to the 10% accuracy of random guessing). Source code on GitHub. All runs are available at Wandb.

The following parameters were used across all experiments:

Batch size: 64
Optimizer: AdamW[11]
Learning rate scheduler: OneCycleLR

My primary objective was to replicate the findings of the PreResNet paper and investigate how adding more layers impacts training. Starting with a 26-layer network as the baseline, I gradually increased the number of layers, ultimately reaching 1,004 layers.

Throughout the training process, I collected statistics on the mean absolute gradient of the first convolutional layer. This allowed me to evaluate how effectively gradients propagated back through the network as the depth increased.

Training 1k layer experiments. Image by author

Gradient plot for all experiments. Despite the depth, gradient at the first layer are similar in each run. Image by author

Key Observations

Despite increasing the depth to 1,000 layers, the networks successfully converged, consistently achieving the validation accuracy threshold (>50%).
The mean absolute gradient of the first layer remained sufficiently large across all tested depths, indicating effective gradient propagation even in the deepest networks.
The scores of ~94% are weak as SOTA is ~99%. I couldn’t get better scores, leaving space for the next investigations.

Training component analysis

Before diving deeper into ultra-deep networks, it’s crucial to identify which components most significantly impact the ability to train a 1000-layer network. The candidates are:

Activation functions
Normalization layers
Skip connections

Training component analysis experiments. Image by author

Gradient plot for training component analysis experiments. . Image by author

Skip Connections: The Clear Winner

Among all components, skip connections stand out as the most critical factor. Without skip connections, no other modifications — advanced activation functions or normalization techniques — can sustain training for such deep networks. This confirms that skip connections are the cornerstone of vanishing gradient mitigation.

Activation Functions: Sigmoid and Tanh Still Competitive

Surprisingly, the performance of Sigmoid and Tanh activation functions was competitive with modern alternatives like GeLU when accompanied by the normalization layer, and even without LayerNorm Sigmoid got a competitive score compared to GELU without LayerNorm. As we see, the mean gradient for all experiments is quite similar, with TanH without LayerNorm having the highest mean value but at the same time the lowest accuracy.

Mean Gradient Values

The mean gradient values are relatively consistent across experiments, but the gradient trajectories differ. In experiments with LayerNorm, gradients initially rise to approximately 0.5 early in training before steadily declining. In contrast, experiments without LayerNorm exhibit a nearly constant gradient throughout the training process. Importantly, the gradient remains present in all cases, with no evidence of vanishing gradients in the network’s first layer.

Diving Deeper into Skip Connections

Skip connections can be implemented in various ways, with the main difference being how the raw input and transformed output are merged, often controlled by a learnable scaling factor γ. In ConvNext, for instance, the LayerScale[12] trick is employed, where the transformed data is scaled by a small learnable γ, initialized to 1e-6.

This setup has a profound implication:

During the initial training stages, most information flows through the skip connections, as the contribution from the transformation branch (via matrix multiplication and activation functions) is minimal.
As a result, the vanishing gradient issue is effectively bypassed.

Skip connection in ConvNext, with γ symbol. Image by author

Experiment: Varying LayerScale Initialization

To test whether the initialization of γ plays a critical role, I experimented with different starting values for LayerScale. Below is a diagram of a typical skip connection and a table summarizing the results:

Skip connection scale analysis experiments. Image by author

The results show that even with γ initialized to 1 (effectively turning on all transformation branches from the start), training a 1000-layer network remained stable. This suggests that while different versions of skip connections may vary slightly in their implementation, all are equally effective at mitigating the vanishing gradient problem.

> 1000-layer network

Since we’ve established that skip connections are the key to training very deep networks, let’s push the limits further by experimenting with even deeper architectures. To do this, I will gradually increase the network depth, but deeper networks require significantly more computational resources. Therefore, I’ve decided to fit the largest possible network that can run on an RTX 4090 with 24 GB of memory.

Fitting the biggest possible network on 24 GB. Image by author.

The 1607-layer ConvNext was the biggest one I could fit into a GPU memory. There is still no issue with convergence, and the CIFAR10 results are the same.

Summary

To sum up a key finding:

the skip connection is a main vanishing gradient mitigation tool
Tanh/Sigmoid are competitive to GELU when used with skip-connection and LayerNorm. It means that despite flat gradient areas Tanh/Sigmoid works well when accompanied by Skip-Connection and LayerNorm
with skip-connection, you can try any depth you want, only resources constrain you, no matter what activation function you choose

If anybody does not agree with that thesis during the recruitment process, send the link to my blog post as my experiment shows clear evidence!

[1] A ConvNet for the 2020s, Zhuang Liu, CVPR 2022

[2] OpenCLIP, Ross Wightman, Romain Beaumont, Cade Gordon, Vaishaal Shankar, 2021

[3] ImageNet Classification with Deep Convolutional Neural Networks, Alex Krizhevsky, NIPS 2012

[4] Identity Mappings in Deep Residual Networks, Kaiming He, ECCV 2016

[5] Gaussian Error Linear Units, Dan Hendrycks, 2016

[6] Searching for Activation Functions, Prajit Ramachandran, ICLR 2018

[7] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, ICML 2015

[8] Layer Normalization, Jimmy Lei Ba, 2016

[9] Deep Residual Learning for Image Recognition, Kaiming He, CVPR 2016

[10] Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009

[11] Decoupled Weight Decay Regularization, Ilya Loshchilov, ICLR 2019

[12] Going deeper with Image Transformers, Hugo Touvron, ICCV 2021

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Going Beyond the 1000-Layer Convolution Network

Author(s): Bartosz Ludwiczuk

Introduction

Vanishing gradient issue

Mitigation of the vanishing gradient issue

Training 1000 layer network

Key Observations

Training component analysis

Diving Deeper into Skip Connections

Experiment: Varying LayerScale Initialization

> 1000-layer network

Summary

JOIN NOW!

🔥 Recommended Articles 🔥

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Going Beyond the 1000-Layer Convolution Network

Author(s): Bartosz Ludwiczuk

Introduction

Vanishing gradient issue

Mitigation of the vanishing gradient issue

Training 1000 layer network

Key Observations

Training component analysis

Diving Deeper into Skip Connections

Experiment: Varying LayerScale Initialization

> 1000-layer network

Summary

JOIN NOW!

🔥 Recommended Articles 🔥

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement