Training the Same Neural Network with Different Optimizers

Last Updated on January 2, 2026 by Editorial Team

Author(s): Gradient Thoughts

Originally published on Towards AI.

Training the Same Neural Network with Different Optimizers — Source: Image by Conny Schneider on Unsplash

Optimizers are often discussed under a simplistic, surface level lens: adaptive methods like Adam are said to converge faster, while SGD is believed to generalize relatively better but takes longer time. I realized that while I had absorbed these claims over time, my understanding of them was delicate and limited to the pre-stated facts. Most explanations I encountered were either theoretical abstractions or anecdotal rules of thumb.

This led me to ponder upon a subtle point: under identical conditions ,i.e., with the same model, data, initialization and training loop, how much does the choice of optimizer actually change training behavior? Not in terms of final accuracy alone, but in how quickly models learn, how the learning stability varies and how these differences evolve over time.

Hence, rather than rely on benchmark claims or isolated examples, I decided to run a controlled comparative study where the optimizer was the only variable. The goal of this experiment is not to declare the “best” optimizer, but to observe how different optimization strategies behave when everything else is held constant.

You can find the source code for the experiment here.

Experimental setup

To isolate the effect of the optimizer, all other components of the training pipeline were kept constant across all the training runs.

Dataset

MNIST handwritten digits dataset
Training set: 60,000 samples
Test set: 10,000 samples
Inputs normalized using dataset mean and standard deviation

Model Architecture

A simple convolutional neural network consisting of:
– Two convolutional layers with ReLU activations and max pooling
– A fully connected hidden layer
– A final linear classification layer with 10 outputs
The same architecture and parameter initialization were used for all experiments

Training Constraints applied

Loss function Cross-entropy loss
Batch size: 64
Number of epochs: 10

Optimizers
The following optimizers were evaluated:

SGD (learning rate = 0.01)
SGD with momentum (momentum = 0.9, learning rate = 0.01)
Adam (learning rate = 0.001)
RMSprop (learning rate = 0.001)
AdamW (learning rate = 0.001)

Learning rates were chosen based on commonly used defaults.

A fixed random seed was used before each run to control for randomness in weight initialization, data shuffling. Each optimizer was trained from a freshly initialized model.

Results

Training loss dynamics

Across all runs, training loss decreased monotonically, but the rate and smoothness of convergence differed noticeably.

Plain SGD showed the slowest initial convergence, with relatively high training loss in the first few epochs. However, its loss decreased steadily across training, without large oscillations.
SGD with momentum converged significantly faster than plain SGD, reaching low training loss within the first few epochs and maintaining a smooth downward trajectory thereafter.
Adaptive optimizers (Adam, RMSprop, and AdamW) exhibited rapid initial loss reduction, achieving low training loss within the first 2–3 epochs. After this point, loss improvements were incremental and relatively small.
Toward later epochs, adaptive optimizers continued to reduce training loss, but the rate of improvement diminished compared to early training.

Overall, momentum-based and adaptive methods reduced training loss substantially faster than vanilla SGD under identical conditions.

Validation loss and accuracy

Validation accuracy increased rapidly for all optimizers during early epochs, followed by a slower phase of improvement.

Plain SGD demonstrated gradual improvement in validation accuracy, reaching its peak performance later in training compared to other optimizers.
SGD with momentum achieved high validation accuracy early and maintained consistently strong performance across epochs.
Adam, RMSprop, and AdamW reached high validation accuracy within the first few epochs, with relatively small fluctuations thereafter.
Minor oscillations in validation loss and accuracy were observed for adaptive optimizers in later epochs, despite continued decreases in training loss.

By the end of training, all optimizers achieved comparable validation accuracy, with differences largely confined to early convergence behavior and stability rather than final performance.

Experiment Constraints

While the results highlight differences in convergence behavior and stability across optimizers, several important limitations constrain the conclusions that can be drawn from this experiment.

Each optimizer was evaluated using a single random seed and a single training run. Optimizer behavior is inherently stochastic, and results may vary across different initializations or data orderings. Multiple runs would be required to assess variance and statistical robustness.
Learning rates were not individually tuned for each optimizer. Instead, commonly used default values were selected to reflect typical usage. As a result, the observed behavior reflects practical defaults rather than optimal performance for each method.
The training duration was limited to 10 epochs. This favors optimizers that converge rapidly in early training and may underrepresent the long-term behavior of methods like plain SGD, which often continue improving with extended training.

Finally, this experiment isolates optimizer choice while holding architecture and training protocol constant. The results should therefore be interpreted as conditional observations, not universal claims about optimizer superiority.

Interpretation

Under the constraints of this experiment, the observed results suggest that optimizer choice influences how learning unfolds over time more than where training ultimately converges.

A consistent pattern across runs was the rapid early convergence of adaptive optimizers such as Adam, RMSprop, and AdamW. These methods reduced training loss and achieved high validation accuracy within the first few epochs.

In contrast, plain SGD exhibited slower initial progress but continued to improve steadily throughout training. This supports the common intuition that adaptive methods accelerate early optimization, particularly in the initial stages of training.

The trajectory of SGD with momentum suggests that momentum helps smooth gradient updates and traverse the loss landscape more efficiently, without introducing the additional adaptivity present in methods like Adam.

Taken together, these observations suggest that optimizer choice shapes the trajectory of learning more than the destination, at least under the controlled conditions examined here.

Author: Pranav Bhatlapenumarthi
Let’s connect on LinkedIn and GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Training the Same Neural Network with Different Optimizers

Author(s): Gradient Thoughts

Experimental setup

Results

Training loss dynamics

Validation loss and accuracy

Experiment Constraints

Interpretation

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Training the Same Neural Network with Different Optimizers

Author(s): Gradient Thoughts

Experimental setup

Results

Training loss dynamics

Validation loss and accuracy

Experiment Constraints

Interpretation

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement