2: Monitoring ML-OPS Guide Series

Last Updated on July 26, 2023 by Editorial Team

Machine Learning

Concept Drift, Data Drift & Monitoring

2: Monitoring ML-OPS Guide Series — Fig: https://martinfowler.com/ ML System life cycle

The most exciting moment of any machine learning system is when you get to deploy your model, but deploying becomes hard due to statistical issues such as “when past model performance is no more guaranteed for future and model performance degrade over a period of time due to changes of data when the model is deployed in a cloud with frequent data changes” and system engine such as system demands monitoring the ML system often which is manual in nature and tedious which needs to be handled through automation as much as possible.

Now, How to deal with the statistical issue or degrading performance of the model?. How to handle the data changes once the model is deployed?

That is where Concept and Data drift comes into the picture.

Fig: cloud.google.com TFDV workflow to detect data drift

Concept Drift refers to if the desired mapping from x to y changes and it leads to inaccurate predictions due to huge data distribution changes in the productized model.

For example, let's say for a given user, a lot of surprising online purchases, which should have flagged that account for fraud. But to due to COVID-19 those same purchases would not have really been any cause for alarm and the system fails to classify as a fraud as the number of new online purchases increased due to pandemic. That credit card may have been stolen.

Another example of Concept drift, let’s say that x is the size of a house property, and y is the price of a house property, because of changes in the market, houses may become more expensive over time. The same size house will end up with a higher price which is concept drift.

fig: cloud.google.com data distribution histograms allow you to quickly hone in on the changes that occurred in the data, and then determine how to fix it

Data Drift refers to if the distribution of x changes, even if the mapping from x or y does not change. In addition to managing these changes to the data, that leads to the second set of software engine issues, that you will have to manage/automate the job process to deploy a system successfully.

For example, let’s take the housing property use case again. let’s say, people start building larger houses or start building smaller houses and thus the input distribution of the sizes of houses actually changes over time. When you deploy a machine learning system, one of the most important tasks will often be to make sure you can detect and manage any changes. You are implementing a prediction service whose job it is to take queries x and output prediction y, you have a lot of design choices as to how to implement this piece of software.

Here’s a checklist of questions, that might help you with making the appropriate decisions for dealing with the software engineering issues.

Let’s say questions/checklists such as,

Do you need real-time predictions or are batch predictions? For example, real-time prediction includes take one sequence and make a prediction let’s say for speech recognition. Batch prediction is mostly used in hospitals, let’s say Take electronic health records and run an overnight batch process to see if there’s something associated with the patients, that we can spot.
Does your prediction service run into clouds or run at the edge or maybe even in a Web browser?
How much computing resources do you have or can allocate to a given ML system?
What type of security and privacy setting need to support the ML systems in production?
How many queries and throughput need to be supported given query per second?
How type of logging needs to be implemented to backtrack the ML system failure and to support reproducibility?

There is what you need to do to monitor the system performance and to continue to maintain it, especially in the face of concept drift as well as data drift. When you’re building machine learning systems for the very first deployments, will be quite different compared to when you are updating or maintaining a system that has already previously been deployed.

Because unfortunately, most of the first deployment means you may be only about halfway there, and the second half of your work is just starting only after your first deployment, because even after you’ve deployed there’s a lot of work to feed the data back and maybe to update the model, to keep on maintaining the model whenever there is a change of data.

Fig: https://research.google/pubs/pub46555/

How to automate the monitoring process for ML systems once it is deployed?.

With a few pioneering exceptions, most tech companies have only been doing ML/AI at scale for a few years, and many are only just beginning the long journey. This means that:

The challenges are often misunderstood or completely overlooked
The frameworks and tooling are rapidly changing (both for data science and MLOps)
The best practices are often grey

Checklist/Strategies to handle the monitoring ML system includes such as,

1. Do dependency changes result in notification?

2. Do data invariants hold in training and serving inputs, i.e. monitor Training/Serving Skew?

3. Do training and serving features compute the same values always?

4. Is the model deployed is numerically stable?

5. Do the model has not experienced dramatic or slow-leak regressions in training speed, serving latency, throughput, or RAM usage?

6. Is the model has not experienced a regression in prediction quality on served data?

In the next series, will see practical implementation and various techniques to deal with data drift and concept drift, How to make it a practice in the development phase so that maintaining becomes easier post deploying the model into production.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

2: Monitoring ML-OPS Guide Series

Author(s): Rashmi Margani

Machine Learning

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

2: Monitoring ML-OPS Guide Series

Author(s): Rashmi Margani

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement