Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

12 Factors of reproducible Machine Learning in production
Machine Learning

12 Factors of reproducible Machine Learning in production

Last Updated on January 25, 2021 by Editorial Team

Author(s): Ben Koller

Machine Learning

The last two decades have yielded us some great understandings of Software Development. A big part of that is due to the emergence of DevOps and its wide adoption throughout the industry.

Leading software companies follow identical patterns: Fast iterations in software development followed by Continuous Integration, Continuous Delivery, Continuous Deployment. Every artifact is tested on its ability to provide value, always has a state of readiness, and is deployed through automation.

As a field, Machine Learning differs from traditional software development, but we can still borrow many learnings and adapt them to β€œour” industry. For the last few years, we’ve been doing Machine Learning projects in production, so beyond proofs-of-concept, and our goals where the same is in software development: reproducibility. So we built a pipeline orchestrator, strong automation and established a workflow to achieve exactlyΒ that.

Why not just Jupyter Notebooks? Well, how long does it take to construct a Notebook from scratch, with all processing steps, from scratch? And how easy is it to onboard new members to the team? Can you reproduce the results you’ve had two months ago, now, fast? Can you compare today’s results against historic ones? Can you give provenance over your data throughout training? And what happens if your model goesΒ stale?

We’ve faced all of these issues, and more, and now took our experience to deduce 12 factors (as a nod to the 12-factor app) that build the backbone of successful ML in production.

1. Versioning

While obvious to basically all Software Engineers, version control is not a universally accepted methodology among Data Scientists. Let me quote the folks at Gitlab as a quickΒ primer:

Version control facilitates coordination, sharing, and collaboration across the entire software development team. Version control software enables teams to work in distributed and asynchronous environments, manage changes and versions of code and artifacts, and resolve merge conflicts and related anomalies.

In short, versioning lets you safely manage the moving parts of Software Development.

As a special form of Software Development, Machine Learning has unique requirements. First, it has not one but two moving parts: Code and Data. Second, model training happens in (fast) iterations and introduces a high variance of code (e.g. splitting, preprocessing, models).

As soon as data can be subject to change it needs to be versioned to be able to reproducibly and repeatably conduct experiments and train models. Cruder forms of versioning (read: hard-copies) can go a long way, but especially in team scenarios shared, immutable version control becomes critical.

Version control of code is even more key. In addition to the above’s quote, preprocessing code is not just relevant at training but also at serving time and needs to be immutably correlatable with models. Serverless functions can provide an easy-access way to achieve a middle ground between the workflow of Data Scientists and production-ready requirements.

TL;DR: You need to version your code, and you need to version yourΒ data.

2. Explicit feature dependencies

In a perfect world, whatever produces your input data will forever produce exactly the same data, at least structurally. But the world is not perfect, you’re consuming data from an upstream service that’s built by humans and might be subject to change. Features will change, eventually. At best, your models fail outright, but at worst they’ll just silently start to produce garbageΒ results.

Explicitly defined feature dependencies allow for transparent failure as early as possible. Well-designed systems will accommodate feature dependencies both in continuous training as well as at the servingΒ time.

TL;DR: Make your feature dependencies explicit in yourΒ code.

3. Descriptive training and preprocessing

Good software is descriptiveβ€Šβ€”β€Šit can be read and understood easily without reading every line ofΒ code.

And while Machine Learning is a unique flavor of Software Development it doesn’t exempt practitioners from following established coding guidelines. A basic understanding of coding standard essentials can be picked up with very little effort and in a short amount ofΒ time.

Code for both preprocessing and models should follow PEP8. It should consist of meaningful object names and contain helpful comments. Following PEP8 will improve code legibility, reduce complexity, and speed up debugging. Programming paradigms such as SOLID provide thought frameworks to make code more maintainable, understandable, and flexible for future useΒ cases.

Configuration should be separated from code. Don’t hardcode your split ratios, provide them at runtime through configuration. As known from hyperparameter tuning, a well-separated configuration increases the speed of iterations significantly and makes codebases reusable.

TL;DR: Write readable code and separate code from the configuration.

4. Reproducibility ofΒ training

If you can’t reproduce training results you can’t trust the results. While this is somewhat the overarching theme of this blog post, there are nuances to reproducibility. Not just do you need to be able to reproduce a training yourself, the entire team should be able to do so. Obscuring training in Jupyter Notebooks on someone's PC or on some VM on AWS is the literal inverse of reproducible training.

By using pipelines to train models entire teams gain both access and transparency over conducted experiments and training runs. Bundled with a reusable codebase and a separation from the configuration, everyone can successfully relaunch any training at any point inΒ time.

TL;DR: Use pipelines and automation.

5. Testing

Testing comes in many shapes and forms. To give two examples:

  • Unit testing is testing on an atomic levelβ€Šβ€”β€Ševery function is tested individually on its own specific criteria.
  • Integration testing is taking an inverse approachβ€Šβ€”β€Šall elements of a codebase are tested as a group, in conjunction and with clones/mocks of up-and downstream services.

Both paradigms are good starting points for Machine Learning. Preprocessing code is predestined for unit testingβ€Šβ€”β€Šdo transforms yield the right results given various inputs? Models are a great use case for integration testsβ€Šβ€”β€Šdoes your model produce comparable results to evaluation at serving time in a production environment?

TL;DR: Test your code, test yourΒ models.

6. Drift / Continuous training

Drift is a legit problem for production scenarios. You need to account for drift as soon as there is even a slight possibility that data might change (e.g. user input, upstream service volatility). Two measures can mitigate risk exposure:

  • Data monitoring for production systems. Establish automated reporting mechanisms to alert teams of changing data, even beyond explicitly defined feature dependencies.
  • Continuous training on new incoming data. Well-automated pipelines can be rerun on newly recorded data and offer comparability to historic training results to show performance degradation as well as offer a quick way to promote newly trained models into production, given better model performance.

TL;DR: If your data can change run a continuous training pipeline.

7. Tracking ofΒ results

Excel is not a good way to track experiment results. And not just Excel, any decentralized, manual form of tracking will yield non-authoritative and therefore untrustworthy information.

The right approach is automated methods to record training results in a centralized data store. Automation ensures the reliable tracking of every training run and allows for later comparability of training runs against each other. Centralized storage of results gives transparency across teams and allows for continuous analysis.

TL;DR: Track results via automation.

8. Experimentation vs Production models

Understanding datasets requires effort. Commonly, this understanding is gathered through experimentation, especially when operating in fields with a lot of hidden domain knowledge. Start a Jupyter Notebook, get some/all of the data into a Pandas Dataframe, do some hours of out-of-sequence magic, train a first model, evaluate resultsβ€Šβ€”β€ŠJob did. Well, unfortunately not.

Experiments serve a purpose in the lifecycle of Machine Learning. The results of these Experiments are however not models, but understanding. Models from explorative Jupyter Notebooks are proof for understanding, not production-ready artifacts. Gained understanding will need more molding and fitting into production-ready training pipelines.

All understandings unrelated to domain-specific knowledge can however be automated. Generate statistics on each data version you’re using to skip any one-time, ad-hoc exploratory work you might have had to do in Jupyter Notebooks and move straight to the first pipelines. The earlier you experiment in pipelines, the earlier you can collaborate on intermediate results, and earlier you’ll receive production-ready models.

TL;DR: Notebooks are not production-ready, so experiment in pipelines earlyΒ on.

9. Training-Serving-Skew

The avoidance of skewed training and serving environments is often reduced to correctly embedding all data preprocessing into the model serving environments. This is absolutely correct, and you need to adhere to this rule. However, it is also a too narrow interpretation of Training-Serving-Skew.

A little detour to ancient DevOps history: In 2006 the CTO of Amazon, Werner Vogels, coined the term β€œYou build it, you run it”. It’s a descriptive phrase for extending the responsibility of Developers to not only writing but also running the software theyΒ build.

A similar dynamic is required for Machine Learning projectsβ€Šβ€”β€Šan understanding of both the upstream generation of data and the downstream usage of generated Models is within the responsibility of Data Scientists. What system generates your data for training? Can it break, what’s the system SLO (service level objective), is it the same as for serving? How is your model served? What’s the runtime environment, and how are your preprocessing functions applied during serving? These are questions that Data Scientists need to understand and find answersΒ to.

TL;DR: Correctly embed preprocessing to serving, and make sure you understand up-and downstream of yourΒ data.

10. Comparability

From the point in time that introduces a second training script to a project, comparability becomes a fundamental part of any future work. If the results of the second model can not, at all, be compared to the first model, waste was generated and at least one of the two models is superfluous, if notΒ both.

By definition, all model training that is trying to solve the same problem need to be comparable, otherwise, they are not solving the same problem. And while iterations will change the definition of what to compare models on overtime, the technical possibility to compare model training needs to be built into training architecture as a first-class citizen earlyΒ on.

TL;DR: Build your pipelines so you can easily compare training results across pipelines.

11. Monitoring

As a very rough description, Machine Learning models are supposed to solve a problem by learning from data. To solve this problem, compute resources are allocated. First to training the model, later to serving the model. The abstract entity (e.g. the person or the department) responsible for spending the resources during training carries the responsibility forward to serving. Plenty of negative degradations can occur in the lifecycle of a model: Data can drift, models can become bottlenecks for overall performance and bias is a realΒ issue.

The effect: Data Scientists and teams are responsible for monitoring the models they produce. Not necessarily in the implementation of that monitoring, if bigger organizations are at play, but for sure for the understanding and interpretation of monitoring data. At its minimum, a model needs to be monitored for input data, inference times, resource usage (read: CPU, RAM), and outputΒ data.

TL;DR: Again: you build it, you run it. Monitoring models in production is a part of data science in production.

12. Deployability ofΒ Models

On a technical level, every model training pipeline needs to produce an artifact deployable to production. The model results might be horrible, no questions asked, but it needs to end up wrapped up into an artifact you can directly move towards a production environment.

This is a common theme in Software Developmentβ€Šβ€”β€Šit’s called Continuous Delivery. Teams should be able to deploy their software at any given moment, and iteration cycles need to be quick enough to accommodate thatΒ goal.

A similar approach needs to be taken with Machine Learning. It’ll enforce first and foremost a conversation about reality and the expectations towards models. All stakeholders need to be aware of what’s even theoretically possible regarding model results. All stakeholders need to agree on a way to deploy a model, and where it fits into the bigger software architecture around it. It will however also lead to strong automation, and by necessity the adoption of a majority of the factors outlined in the prior paragraphs.

TL;DR: Every training pipeline needs to produce a deployable artifact, not β€œjust” aΒ model.

Closing

This is by no means an exhaustive list. It’s the combination of our experience, and you’re welcome to use it as a boilerplate to benchmark your production architecture, or as a blueprint to design yourΒ own.

We used these factors as the guiding principles for ZenML, our ML orchestrator. So before you start from scratch, check out ZenML on GitHub: https://github.com/maiot-io/zenml


12 Factors of reproducible Machine Learning in production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓