Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Ten Patterns and Antipatterns of Deep Learning Experimentation
Latest   Machine Learning

Ten Patterns and Antipatterns of Deep Learning Experimentation

Last Updated on November 15, 2023 by Editorial Team

Author(s): Dmitrii Khizbullin

Originally published on Towards AI.

The image is generated by DALL-E 3 from the author’s prompt.

Introduction

In this article, I present a list of patterns and antipatterns I have collected from 10 years of experience as a deep learning engineer. Deep learning engineering is all about experimentation. Coming up with the initial minimal viable product is frequently very fast. In contrast, most of a deep learning project life cycle is taken by iterative improvement of the code and metrics/score. More often than not, the road to the high score compounds from the multitude of marginal improvements discovered in extensive experimentation. In my practice, only one of 5 or even one of 10 experiments improves the score. It is essential to streamline the process of experimentation, especially when working in a team.

First of all, let’s clarify the used terms. In the following material, I consider maximizing a single metric, further called “the metric”. By the baseline, I mean the state of the code and the value of the metric that contains no change under consideration. By regression, I imply a decrease in the metric, i.e., an undesired outcome. On the contrary, by progression, I mean an increase in the metric, i.e., a desired outcome.

Patterns and Antipatterns

#1 Reliability of results

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Run the baseline and an experiment once and draw conclusions based on the comparison. Suppose the outcomes have significant variance, for instance, due to poor training convergence as a result of choosing a high learning rate or overfit. In that case, you may draw the wrong conclusion of whether you observe progression or regression.

Pattern:

Run multiple training cycles to estimate the variance of outcomes. Running multiple deep learning training cycles may be resource- and time-consuming, but it is crucial for understanding the variance of the outcomes and further reliable progression of changes. If needed, analyze the measurements with double- or single-sided Student’s T-test.

#2 Wishful thinking

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Assuming that some change is guaranteed to improve the metric, proceed with merging it. Both software and algorithmic changes can introduce a regression, apparent or subtle.

Pattern:

After any change, re-run the training to ensure no regression. An apparent regression like a training crash can be easy to spot, even in the early iterations of training. A subtle regression can only be spotted by the end of the training cycle.

#3 Regression shadowing

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Making several changes in a single bundle that, according to the researcher, all improve the metric. The regression shadowing can kick in, i.e., change A gives a slight regression, whereas change B provides a significant progression. Combined, the two may show improvement in the metric. In this way, the slight regression will be overshadowed by substantial improvement. However, you only want the change that brings in an improvement while dropping the change that decreases the metric.

Pattern:

Measure changes separately. If you have two potential improvements on your mind, run them separately:

  1. Change A
  2. Change B
  3. Changes A and B together

This will avoid regression shadowing.

#4 Strings attached

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Launching training from Git working copy. In this case, there will be no way to repeat the experiment in the future or analyze the precise difference between experiments.

Pattern:

All changes, including temporary ones, must be committed so that the Git working copy is clean. The commit must be tagged. As an alternative to tagging, the experiment artifacts must store the commit hash of the code run.

#5 Fast cycle

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Always launching the heavy (final) version of the training. Since the heavy version is slow, it takes a lot of time to get experiment results and may deter from fine change study.

Pattern:

Maintaining a fast version of the training cycle. The fast version must be 2 to 10 times faster and be able to reflect the effect of changes under assessment. Often, the fast version has fewer iterations to train and a faster, reduced-in-size neural network. The fast version can act as both an integration test and a quick way to assess how promising the change is.

#6 Branchless

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Not using branches for separate features and keeping all modifications in the main branch.

Pattern:

Having a separate branch for each change. Keeping the changes in a different branch will allow us to avoid merging the code that shows regression so that the code is not bloated with options, modes, and if-else blocks. Sometimes, it takes a mental effort to scrap a lot of code that supports an experiment that, unfortunately, shows no progression.

#7 Coding habits

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Not using software engineering practices to develop the code. Too often, I hear the arguments towards hacky, ad-hoc code along the lines of “we are not a software engineering team/project” and “in research, things are done differently”. However, when the struggle is for the top score, losing even a fraction of a percent due to vulgaris software bugs can be fatal.

Pattern:

Write code that is architecturally well-structured, contains no code duplication, has all “magic numbers” named, and has unit and integration tests. Extensive tests and code coverage may be unaffordable for a fast-moving applied research project. However, unit tests for compact and algorithmically dense functions like a feature encoder-decoder provide significant benefits in the light of possible flawed experiments, wrong conclusions, and even misguided researcher’s intuition.

#8 Non-invasive

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Make “invasive” changes and merge them. By invasive changes, I mean the change that alters the behavior of other modes even though it improves the targeted mode. Consider the following example. The code supports training two neural network backbones: backbone X and backbone Y. Training on each backbone is a separate mode. You want to change the learning rate for backbone X to improve its metric. You alter the learning rate for both modes, run the experiment for backbone X, and observe the improvement in the metric. Merging the code “as is” is an antipattern since the learning rate gets altered for backbone Y, for which an experiment has not been run and, once run, may result in a regression.

Pattern:

Make non-invasive changes that do not alter the code’s behavior of all other modes of operation. Coding up a non-invasive change may be tedious and time-consuming. For faster experimentation, one can consider making an invasive change first, running the experiment, and only if it shows promising results, re-writing the code to be non-invasive. Notably, the experiment must be re-run after the non-invasive change is ready. This step will help avoid potential regression due to bugs in preparing the code for merging into the main branch.

#9 Re-run baseline

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern:

Making use of an old historical number of the baseline metric and comparing the new experiment results with it.

Pattern:

Re-running the baseline (from the main branch) occasionally, be it once a week over the weekend or even along with each experiment. Even though the baseline code may change, an external factor can throw off the baseline. Such an external factor can be:

  1. Running experiments on different machines
  2. Software gets updated: CUDA toolkit, conda packages
  3. The baseline and the experiment were run by different team members, who may have different undocumented ways to launch the experiments.

#10 Metric versioning

The image is generated by DALL-E 3 from the author’s prompt.

Antipattern

Altering the validation data/procedure/metric and comparing the new experiment metric with the historic metric values is an antipattern. For example, if the validation data was cleaned along with the training data, one may assume that the data cleaning is always an improvement. It may be the case in general; however, the metric has changed and is no longer comparable with the historical results.

Pattern

After changing how the metric is computed, re-run the baseline to figure out the value of the new metric.

Conclusion

The image is generated by DALL-E 3 from the author’s prompt.

By avoiding the described ten antipatterns and by following the ten patterns, you will be able to bring your deep learning experimentation to the next level.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓