Impact of Optimizers in Image Classifiers
Last Updated on January 7, 2023 by Editorial Team
Last Updated on August 30, 2022 by Editorial Team
Author(s): Toluwani Aremu
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
INTRODUCTION
Ever wondered why a DNN fails to perform as high as expected when it comes to accuracy, especially when there are official or unofficial reports of experts and enthusiasts getting some top performance with the same network and on that same dataset you are using? I remember having hard times trying to wrap my head around the thought that my models just failed when it was expected to perform well. What causes this? In reality, there are lots of factors with varying levels of potential to impact the performance of your architecture. However, Iβll discuss just one in this article. This factor is βThe choice of Optimization algorithm toΒ useβ.
What is an optimizer? An optimizer is a function or algorithm that is created and used for neural network attribute modification (i.e., weights, learning rates) for the purpose of speeding up convergence while minimizing loss and maximizing accuracy. DNNs use millions of billions of parameters, and you need the right weights to ensure that your DNN learns well from the given data while generalizing and adapting well for a good performance on unseen relatedΒ data.
Different optimization algorithms have been built over the years, and some of these algorithms have advantages over others, as well as their cons. Therefore, it is imperative to know the basics of these algorithms, as well as understand the problem being worked on so that we can select the best optimizer to workΒ with.
Furthermore, I noticed that a lot of researchers use the SGD-M (Stochastic Gradient Descent with Momentum) optimizer, but in the industry, Adam is favored more. In this article, I will give brief high-level descriptions of the most popular optimizers being used in the AI world. Actually, I had to do a number of experiments to see the difference between these optimizers and answer some questions I have about the use of these optimizers, as well as give clues on which optimizer is the best and when/how to use them based on my observations.
BASIC DESCRIPTION OF DIFFERENT OPTIMIZERS
In this section, I will briefly discuss the Stochastic Gradient Descent with Momentum(SGDM), Adaptive Gradient Algorithm (ADAGRAD), Root Mean Squared Propagation (RMSProp), and the Adam optimizers.
SGDM: Since the Gradient Descent (GD) optimizer uses the whole training data to update the modelβs weights, it becomes so computationally expensive when we have millions of data points. Due to this, the Stochastic Gradient Descent (SGD) was created to solve this problem by using each datapoint to update the weights. Still, this was computationally expensive for Neural Networks (NN)each datapoint used in the NN needed both forward and back propagations. Also, with SGD, we canβt increase the learning rate while it tries to reach the global minimum. This makes convergence very slow while utilizing the SGD. The SGDM was the solution to that, as it added a momentum term to the normal SGD, which improved the speed of convergence. For deeper explanations, clickΒ here.
ADAGRAD: Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization which tries to adapt the learning rate to the parameters. The learning rate fits the parameters component by component by incorporating insights from past observations. It makes minor updates to parameters associated with frequent features and major updates to those with features that arenβt occurring frequently. Adagrad also eliminates the need for tuning the learning rate manually as it automatically updates the learning rate based on the parameters. However, the learning rate shrinks fast, making the model think it is close to achieving convergence and stops somewhat short of the expected performance. To learn more, clickΒ here.
RMSProp: Proposed by Geoffrey Hinton (even though it remains unpublished), the RMSProp is an extension of the GD and the AdaGrad version of gradient descent that uses a decaying average of partial gradients in the adaptation of the step size for each parameter. It was discovered that the magnitude of gradients can be different for different parameters and could change during the training. Therefore, Adagrad's automatic choice of learning rate could be the nonoptimized choice. Hinton solved this by updating the learned weights using a moving average of the squared gradients. To learn more, clickΒ here.
Adam: This optimizer was proposed by Diederik Kingma and Jimmy Ba in 2015 and could arguably be regarded as the most popular optimizer ever created. It combines the advantages and benefits of SGDM and RMSProp in the sense that it uses momentum From SGDM and scaling from RMSProp. It is computationally efficient, unlike both GD and SGD, and requires only a little memory. It was designed to be used on problems with very noisy/sparse gradients. To learn more, click here orΒ here.
EXPERIMENTS
Due to the size of my computing resource, I decided to focus on using LeNet and AlexNet on the CIFAR-10 dataset. The CIFAR-10 dataset consists of 50000 training images and 10000 test images. I trained these models for 50 epochs using the SGD, SGDM, Adagrad, RMSProp, and Adam optimizers. For the SGDM, I used a momentum of 0.9. The global learning rate for my first set of experiments was 0.001Β (1e-3).
Note: I am not seeking very good results. I am instead trying to see the impact of each optimizer on the modelβs performance.
I start by calling the important libraries:
Then, I loaded and transformed the CIFAR-10Β dataset:
The LeNet and AlexNetΒ models:
To get the full code, check out this repository (give it a star if you donβtΒ mind).
The results are asΒ follows.
On the LeNet model, the test accuracy of SGDM was the highest at almost 70%, while its training loss was 0.635. Adam had the least training loss, but their test accuracy was just 67%. LeNet with Adagrad was woeful and had a 48% test accuracy which was way lesser than the SGD, which had 54.03%. RMSProp gave a test accuracy of 65% and a train loss ofΒ 0.630.
As for the AlexNet model, SGDM still had the best test accuracy of 83.75%, closely followed by Adagrad with 82.79%. However, the training loss of SGD was 0.016 while Adagrad had 0.005, which is so small and gave the model little room for improvement. The Adam result was surprisingly low, given how highly rated it is in the AI sector. RMSProp seemed not to have convergence confidence but had similar test accuracy withΒ Adam.
From the LeNet results, one could have easily concluded that Adagrad is a bad optimizer, and from the AlexNet results, the RMSProp looked like an optimizer capable of helping the model overfit on the training data, but there is more to this than just making this early conclusion. More experiments have to be carried out to investigate thisΒ issue.
FURTHER EXPERIMENTS
Due to the results of RMSProp and Adam, while using the AlexNet model, another experiment was carried out, this time using a learning rate ofΒ 1e-5.
Now, this is more like it. A lower learning rate stabilized the RMSProp optimizer and improved Adamβs performance. We could easily conclude and say it is better to use lower learning rates for optimizers that employ scaling. However, we need to be sure that this isnβt general, so I tried using a lower learning rate with SGDM, and that gave me very poor results. Hence, lower learning rates are better suited for scaling optimizers.
Still, we donβt have enough experiments to make other observations, so in the next section, I will discuss the current observations from the currently short experiments on each optimizer.
DISCUSSIONS AND CONCLUSION
SGD: Not Recommended! While it is sure to converge, it normally takes time to learn. What the SGDM or Adam could learn in 50 epochs, the SGD will learn in about 500 epochs. However, there is a good chance that you can get some decent results when you start with a big learning rate (i.e., 1e-1). You can also use it if you have enough time to wait for convergence; else, stayΒ away.
SGDM: Recommended! This optimizer has given the best results in the experiments. However, it might not work well if the starting learning rate is low. Otherwise, it converges fast and also helps the modelβs generalizability. It is totally recommended!
Adagrad: Recommended! From the experiments, it could be said that this optimizer is the worst to use, especially when you are using a small model like LeNet on complex datasets. However, in deeper networks, it could give good results, but optimal performance isnβt guaranteed.
RMSProp: Recommended! This optimizer has also given a very good performance. When used with a lower learning rate, it could give better performances. Asides from the performance, its converging speed is high, and we can see the reason why it is being used sometimes in production sectors (industry).
Adam: Recommended! According to some experts, Adam learns all patterns, including the noise in the train set, and therefore it is fast to converge. However, in the experiments above, we can see that it doesnβt converge as well as the SGDM, but it converges and learns fast. Also, I could bet that its performance on bigger datasets (which would, of course, contain more noise) will be better than the other optimizers discussed above.
With this practical look into the popular optimizers in use today, I hope you have gotten some insights and intuition about why optimizers are needed and how these optimizers affect model performance. If you have suggestions and feedback, please leave a comment or connect with me on LinkedIn. ThankΒ you.
To learn about these optimizers, as well as other optimizers not touched on in this article, please use thisΒ link.
To access the codes used here, repository.
Impact of Optimizers in Image Classifiers was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI