Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Steps toward MLOps research — Software Engineering your AI
Latest

Steps toward MLOps research — Software Engineering your AI

Last Updated on July 21, 2022 by Editorial Team

Author(s): Ori Abramovsky

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Steps Toward MLOps Research — Software Engineering Your AI

Data scientists a minute after their models were deployed to the production environment. Photo by Artem Beliaikin on Unsplash

MLOps is an old requirement for a new field; the Machine Learning world is evolving, from a niche topic in the darkest backgrounds to a first-class citizen in a wide range of use cases. But with great powers comes great responsibility; besides adding new, domain-specific requirements (such as — Model Explainability, collaborating on the research, and making it readable), also more traditional general software development needs (such as — monitoring, API serving, and Release Management) becomes more and more vital to AI applications as well. These procedures are well formulated in the general software development world but are quite new to the Machine Learning domain and, therefore, will take time until fully adopted. But in the meanwhile, there is much which can be done; the initial step should be to embrace general software development concepts and fit them into the Machine Learning world. Small changes to the way in which we interact and manage our AI projects. Below are key points toward that target. Simple but at the same time crucial for that need.

Data scientists a minute after they realize their models should be handled on production as well. Photo by Darius Bashar on Unsplash

Reproducible research — keep train dataset

One of the main phenomenons which bother us most in the Machine Learning world is Non-Reproducible Research papers; ones that demonstrate extreme results but at the same time don’t share their code, their data, or a way to reproduce their outcomes. Now zoom in to your organizations; let’s assume we have a model running for a while, serving our customers and showing performance X. As much time passed since it was initially developed, we decided to revisit its performance in order to find if re-tuning is needed. The issue is the one who created the model has left/ is working on a different project, and now we need to re-research the model in order to understand its numbers better. Commonly during such processes, we will come across edge cases like anomalous data pieces, new potential features, missing values, or just having our unique point of view on the problem at hand. Each of these points can send us on a long journey only to answer questions like; did the model creators familiar with it? And If they do, were they doing stuff differently?. The problem is it can easily turn into a waste of time, pursuing ghosts without knowing these are dead ends. But what if we could jump back in time to when the model was initially trained and validate these questions? As great minds think alike, most likely, many of our ‘new fascinating’ ideas were already evaluated!. But until one will develop a working time machine, the best we can do is to save the original train population that the model was trained on, enabling us to validate those questions in hand easily. Systematically saving train dataset snapshots will ease many of such later requirements.

Explainable research — log what and what not to do

Continuing our previous point, for some cases keeping a snapshot of the training dataset is not enough as it will lack some important context points. Answering questions like “why the model doesn’t use feature X for its predictions?” Or “wasn’t it better to represent the data in a different way?” is not feasible with only having the train dataset snapshots. What we lack is the context that led to some of these decisions, commonly not visible by the data itself. A super important solution for this need can be to have more structured research methods; using tools like iPython notebooks enable us to understand what was done easily, why, and how. Such tools are important even for the single data scientist working solo in a wider R&D organization. Keeping well-documented and updated notebooks describing the research we made and the decisions we took can enable us to easily handle questions and concerns that may arise in the future. Becoming an open book of our research, it may even make our generated models self explained to such future concerns.

Software dev research — make it a project and GIT it

Now that we decided to embrace iPython notebooks as our main research tool, and once we spent efforts to make it more structured and readable, it’s tempting to move all our developments to that ecosystem, to make even our production scripts managed and triggered as notebooks. Personally, I find iPython notebooks super beneficial during the initial research and POCing phases but also super counterintuitive and error-prone the minute that research was done. Code development processes commonly include versions, defects, user stories, documentation, and many other modern software engineering features. All are well suited for the use of source code hosting tools such as GIT, enabling us to manage our code development states easily. Keeping our notebooks versions using Version Control systems is the first step in that direction. But this is not enough. Once the research phase is over, it’s important to treat our notebooks as regular software development projects and to deal with the relevant requirements it adds. One such example is to make sure our code is reproducible, a need which is super important for two main places — the data creation phase (enabling us to easily investigate and fix data defects by making the related code more clear and visible, also enabling easy research to develop transition) and the Production phase (enabling a clear view of the app requirements such as how to support online inference as well as general context points such as the projected latency). Such features are less visible when our code is managed across different notebooks. Moreover, notebook’ based development can easily tolerate code anti-patterns such as Code Duplications. Once visualization, readability, and research are less critical, it’s time to move our iPython notebooks snippets to be managed as general code projects.

Research software dev— notebooks context

Now that we learned how important it is to move our code to general projects the minute we can, we should highlight again why we decided to use iPython notebooks in the first place; briefly looking at AI developments has two main phases — the exploration/ general research phase and the more specific follow-ups, when we know what to do and start implementing it. By easily enabling visualization and explainability, iPython notebooks are clearly the best tool for the research phase. While previously we described why it is important not to rely too much on notebooks, the mirroring is not to rely too much on general code projects; similar to before, there will be a tendency to move all our developments out of the notebooks ecosystem. Such a common step would include refactoring the notebooks, moving code snippets to be hosted at code projects, and using their imports (instead of the plain code) in our notebooks. But the hidden caveat is the fact that notebooks and general code projects have different progress and release processes — revisiting our notebooks a few months later can hide the fact that while the notebook was generated with imports of code version X, the current code base is at version X+m, which is not relevant anymore to the notebook state we’re seeing (defects were fixed, APIs are broken, etc..). What we lack is a combined code project to notebook state. There are some possible solutions for that, like putting notebooks to part of the code projects, which can enable further investigations to understand what code versions they relied on. But a more straightforward approach is to make it clear in the exploration notebooks what they were trying to investigate and what assets, states and versions they relied on. In some cases, even (forgive me, god) just copying the relevant code can make sense.

Research monitoring — model performance

An important point that many of us tend to forget is the fact that our work doesn’t end once the research is over, the development is over, and even when the deployment is over. In some aspects, it only starts once users start interacting with our generated models. Commonly the research we do includes a need and a relevant thesis. For example, in order to increase customer satisfaction (the need), we will try to improve our service, making sure the customer's requests are being handled and solved in time, assuming it to improve their satisfaction (the thesis). Let’s assume we created a model for our company’s support teams that highlight customers' requests at risk of not being solved in time. Once users (the support employees) start using our app, there are a few metrics we should watch; first, if customers' requests are not handled now more properly, then further debugging is required; maybe there is an issue with the UX, the data we use or the model we generated. Then it’s important to verify that our thesis holds that once more requests are being properly handled, the satisfaction should rise. From an operations POV, it’s important to keep verifying these characteristics. A common metric to watch is class distribution — looking for sudden shifts (which can indicate an issue) and comparing it to our research baseline numbers (given that our model predicted positive X% on the validation split and now its 3*X% or X/3%, it can indicate an issue). For some cases, our models will be periodical, by definition, required to be retrained once in a while. For these cases, we would like to generate the models using code (and not in one-off notebooks). Such code can be part of the general CI-CD deployment process and should include automatic tests to verify if the generated new model should replace the existing one. Such tests will probably analyze the very same metrics we mentioned and, therefore, can be reused for that need.

Research testing — data verifications

Now that we have started monitoring our model, we should review the other possible tests to apply; Besides regular operational metrics (like latency, memory in use, requests failures, etc..) and the intuitive performance-wise ones (accuracy rates), we would like to use more classic testing such as unit, functional and integration tests. Given the high importance of data to AI application's success, the first area to apply these tests should be the data domain. Moreover, as in many cases, we are the data consumers (to be produced elsewhere), and we would like to unit-test and monitor key properties of our data. The naive approach will be to verify properties like data definitions generally. It’s naive not because of the lack of importance but since the fact this is a never-ending effort — every data field can be further described and analyzed in numerous aspects (null values, density, ranges, etc..). Testing all these cases will be hard to maintain and can easily lead to many FP alerts. A more reasonable solution can be built in an opposite way; first, assume everything works correctly and start with verifying only the 911 cases (like that new data arrived at all). Then, generate new data tests only following the identification of data defects (to verify them in the future). Using that approach will make sure you don’t have too many tests, and for the ones you have, there will be a clear motivation to make sure they work properly.

Software development research and vice versa

AI research can only benefit from the general software development concepts. Given the high rise of AI, ML engineers, and tools like iPython notebooks, it’s fair to assume that vice versa will happen as well, and general software developments will also find AI concepts to benefit from. For example, start using notebooks as a DevOps tool to describe the environment properties at hand. Exciting days are ahead.


Steps toward MLOps research — Software Engineering your AI was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓