Improve Your ML Models Training
Last Updated on January 6, 2023 by Editorial Team
Author(s): Fabiana Clemente
Deep Learning
Cycling learning rates in Tensorflow 2.0
Deep learning has found its way into all kinds of research areas in the present times and has also become an integral part of our lives. The words of Andrew Ng help us to sum it up reallyΒ well,
βArtificial Intelligence is the new electricity.β
However, with any great technical breakthroughs come a large number of challenges too. From Alexa to Google Photos to your Netflix recommendations, everything at its core is just deep learning, but it comes with a few hurdles of itsΒ own:
- Availability of huge amounts ofΒ data
- Availability of suitable hardware for high performance
- Overfitting on available data
- Lack of transparency
- Optimization of hyperparameters
This article will help you solve one of these hurdles, which is optimization.
The problem with the typical approach:
A deep neural network usually learns by using stochastic gradient descent and the parameters ΞΈ (or weights Ο) are updated asΒ follows:
where L is a loss function and Ξ± is the learningΒ rate.
We know that if we set the learning rate too small, the algorithm will take too much time to converge fully, and if itβs too large, the algorithm will diverge instead of converging. Hence, it is important to experiment with a variety of learning rates and schedules to see what works best for ourΒ model.
In practice, there are a few more problems which arise due to thisΒ method:
- The deep learning model and optimizer are sensitive to our initial learning rate. A bad choice of the starting learning rate can greatly hamper the performance of our model from the beginning itself.
- It could lead to a model that is stuck at local minima or in a saddle point. When that happens, we may not be able to descend to a place of lower loss even if we keep on lowering our learning rateΒ further.
Cyclic Learning Rates help us overcome theseΒ problems
Using Cyclical Learning Rates you can dramatically reduce the number of experiments required to tune and find an optimal learning rate for yourΒ model.
Now, instead of monotonically decreasing the learning rate,Β we:
- Define a lower bound on our learning rate (base_lr).
- Define an upper bound on the learning rate (max_lr).
So the learning rate oscillates between these two bounds while training. It slowly increases and decreases after every batchΒ update.
With this CLR method, we no longer have to manually tune the learning rates and we can still achieve near-optimal classification accuracy. Furthermore, unlike adaptive learning rates, the CLR method requires no extra computation.
The improvement will be clear to you by seeing anΒ example.
Implementing CLR on aΒ dataset
Now we will train a simple neural network model and compare the different optimization techniques. I have used here a dataset on Cardiovascular disease.
These are all the imports youβll need over the course of the implementation:
And this is what the data looksΒ like:
The column cardio is the target variable, and we perform some simple scaling of the data and split it into features (X_data) and targets (y_data).
Now we use train_test_split to get a standard train to test ratio of 80β20. Then we define a very basic neural network using aSequential model from Keras. I have used 3 dense layers in my model, but you can experiment with any number of layers or activation functions of yourΒ choice.
Training withoutΒ CLR:
Here I have compiled the model using the basic βSGDβ optimizer which has a default learning rate of 0.01. The model is then trained over 50Β epochs.
To show you just the last few epochs, the model takes 3s per epoch and in the end, gives 64.1% training accuracy and 64.7% validation accuracy. In short, this is the result our model gives us after ~150 seconds of training:
Training usingΒ CLR:
Now we use Cyclical Learning Rates and see how our model performs. TensorFlow has this optimizer already built-in and ready to use for us. We call it from the TensorFlow Addons and define it asΒ follows:
The value of step_size can be easily computed from the number of iterations in one epoch. So here, iterations perΒ epoch
= (no. of training examples)/(batch_size)
= 70000/350
= 200.
βexperiments show that it often is good to set stepsize equal to 2 β 10 times the number of iterations in anΒ epoch.βΒΉ
Now compiling our model using this newly defined optimizer,
we see that now our model trains much faster, taking even less than 50 seconds inΒ total.
Loss value converges faster and oscillates slightly in the CLR model as we wouldΒ expect.
Training accuracy has increased from 64.1% toΒ 64.3%.
Testing accuracy also improves, from 64.7% toΒ 65%.
Conclusion
When you start working with any new dataset, the same values of learning rates you used in previous datasets will not work for your new data. So you have to perform an LR Range Test which gives you a good range for learning rates suitable for your data. Then you can compare your CLR with a fixed learning rate optimizer, as we saw above, to see what suits best to the performance goal you have. So to get this optimal range for the learning rate, you can run the model on a less number of epochs as long as the learning rate keeps increasing linearly. Then oscillating the learning rate between these bounds will be enough to give you a close to optimal result in a few iterations itself.
This optimization technique is clearly a boon as we no longer have to tune the learning rate ourselves. We achieve better accuracy in fewer iterations.
References:
[1] Smith, Leslie N. βCyclical learning rates for training neural networks.β 2017
Improve Your ML Models Training was originally published in Towards AIβββMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI