Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Training the Same Neural Network with Different Optimizers
Latest   Machine Learning

Training the Same Neural Network with Different Optimizers

Last Updated on January 2, 2026 by Editorial Team

Author(s): Gradient Thoughts

Originally published on Towards AI.

Training the Same Neural Network with Different Optimizers
Source: Image by Conny Schneider on Unsplash

Optimizers are often discussed under a simplistic, surface level lens: adaptive methods like Adam are said to converge faster, while SGD is believed to generalize relatively better but takes longer time. I realized that while I had absorbed these claims over time, my understanding of them was delicate and limited to the pre-stated facts. Most explanations I encountered were either theoretical abstractions or anecdotal rules of thumb.

This led me to ponder upon a subtle point: under identical conditions ,i.e., with the same model, data, initialization and training loop, how much does the choice of optimizer actually change training behavior? Not in terms of final accuracy alone, but in how quickly models learn, how the learning stability varies and how these differences evolve over time.

Hence, rather than rely on benchmark claims or isolated examples, I decided to run a controlled comparative study where the optimizer was the only variable. The goal of this experiment is not to declare the “best” optimizer, but to observe how different optimization strategies behave when everything else is held constant.

You can find the source code for the experiment here.

Experimental setup

To isolate the effect of the optimizer, all other components of the training pipeline were kept constant across all the training runs.

Dataset

  • MNIST handwritten digits dataset
  • Training set: 60,000 samples
  • Test set: 10,000 samples
  • Inputs normalized using dataset mean and standard deviation

Model Architecture

  • A simple convolutional neural network consisting of:
    – Two convolutional layers with ReLU activations and max pooling
    – A fully connected hidden layer
    – A final linear classification layer with 10 outputs
  • The same architecture and parameter initialization were used for all experiments

Training Constraints applied

  • Loss function Cross-entropy loss
  • Batch size: 64
  • Number of epochs: 10

Optimizers
The following optimizers were evaluated:

  • SGD (learning rate = 0.01)
  • SGD with momentum (momentum = 0.9, learning rate = 0.01)
  • Adam (learning rate = 0.001)
  • RMSprop (learning rate = 0.001)
  • AdamW (learning rate = 0.001)

Learning rates were chosen based on commonly used defaults.

A fixed random seed was used before each run to control for randomness in weight initialization, data shuffling. Each optimizer was trained from a freshly initialized model.

Results

Source: Image by the author

Training loss dynamics

Across all runs, training loss decreased monotonically, but the rate and smoothness of convergence differed noticeably.

  • Plain SGD showed the slowest initial convergence, with relatively high training loss in the first few epochs. However, its loss decreased steadily across training, without large oscillations.
  • SGD with momentum converged significantly faster than plain SGD, reaching low training loss within the first few epochs and maintaining a smooth downward trajectory thereafter.
  • Adaptive optimizers (Adam, RMSprop, and AdamW) exhibited rapid initial loss reduction, achieving low training loss within the first 2–3 epochs. After this point, loss improvements were incremental and relatively small.
  • Toward later epochs, adaptive optimizers continued to reduce training loss, but the rate of improvement diminished compared to early training.

Overall, momentum-based and adaptive methods reduced training loss substantially faster than vanilla SGD under identical conditions.

Source: Image by the author

Validation loss and accuracy

Validation accuracy increased rapidly for all optimizers during early epochs, followed by a slower phase of improvement.

  • Plain SGD demonstrated gradual improvement in validation accuracy, reaching its peak performance later in training compared to other optimizers.
  • SGD with momentum achieved high validation accuracy early and maintained consistently strong performance across epochs.
  • Adam, RMSprop, and AdamW reached high validation accuracy within the first few epochs, with relatively small fluctuations thereafter.
  • Minor oscillations in validation loss and accuracy were observed for adaptive optimizers in later epochs, despite continued decreases in training loss.

By the end of training, all optimizers achieved comparable validation accuracy, with differences largely confined to early convergence behavior and stability rather than final performance.

Experiment Constraints

While the results highlight differences in convergence behavior and stability across optimizers, several important limitations constrain the conclusions that can be drawn from this experiment.

  1. Each optimizer was evaluated using a single random seed and a single training run. Optimizer behavior is inherently stochastic, and results may vary across different initializations or data orderings. Multiple runs would be required to assess variance and statistical robustness.
  2. Learning rates were not individually tuned for each optimizer. Instead, commonly used default values were selected to reflect typical usage. As a result, the observed behavior reflects practical defaults rather than optimal performance for each method.
  3. The training duration was limited to 10 epochs. This favors optimizers that converge rapidly in early training and may underrepresent the long-term behavior of methods like plain SGD, which often continue improving with extended training.

Finally, this experiment isolates optimizer choice while holding architecture and training protocol constant. The results should therefore be interpreted as conditional observations, not universal claims about optimizer superiority.

Interpretation

Under the constraints of this experiment, the observed results suggest that optimizer choice influences how learning unfolds over time more than where training ultimately converges.

A consistent pattern across runs was the rapid early convergence of adaptive optimizers such as Adam, RMSprop, and AdamW. These methods reduced training loss and achieved high validation accuracy within the first few epochs.

In contrast, plain SGD exhibited slower initial progress but continued to improve steadily throughout training. This supports the common intuition that adaptive methods accelerate early optimization, particularly in the initial stages of training.

The trajectory of SGD with momentum suggests that momentum helps smooth gradient updates and traverse the loss landscape more efficiently, without introducing the additional adaptivity present in methods like Adam.

Taken together, these observations suggest that optimizer choice shapes the trajectory of learning more than the destination, at least under the controlled conditions examined here.

Author: Pranav Bhatlapenumarthi
Let’s connect on LinkedIn and GitHub

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.