Deployment & Serving: Exploring 6 Key MLOps Questions using AWS SageMaker
Last Updated on November 5, 2023 by Editorial Team
Author(s): Anirudh Mehta
Originally published on Towards AI.
Deployment & Serving: Exploring 6 Key MLOps Questions using AWS SageMaker
This article is part of the AWS SageMaker series for exploration of β31 Questions that Shape Fortune 500 ML Strategyβ.
What?
The previous blog posts, βData Acquisition & Explorationβ, βData Transformation and Feature Engineeringβ and βExperiments, Model Training & Evaluationβ, explored how AWS SageMakerβs capabilities can help data scientists collaborate and accelerate the journey from data exploration to model creation.
This blog post will focus on key questions related to deploying and serving the model and explore how AWS SageMaker can help address them.
β’ [Automation] How do you ensure that the deployed models can scale with increasing workloads?
β’ [Automation] How are the new versions rolled out and the process to compare them against the running version? (A/B testing, canary, shadow, etc.)
β’ [Automation] Are there mechanisms to roll back or revert deployments if issues arise?
β’ [Collaboration] How can multiple data scientists understand the impact of their version before releasing it? (A/B testing, canary, shadow, etc.)
β’ [Reproducibility] How do you package your ML models for serving in the cloud or at the edge?
β’ [Governance & Compliance] How do you track the predicted decisions for auditability and accountability?
Use Case & Dataset
We will reuse the Fraud Detection use-case, dataset generator, and generated customer & transaction dataset.
Host with SageMaker Hosting
[U+2713] [Automation] How are the new model versions rolled out / deployed?
In the previous article, we trained a fraud detection model, deployed it, and validated it using our test dataset. We deployed the models in an always-online, real-time inference mode, accessible via REST APIs. However, addressing enterprise challenges often requires a more tailored approach to meet specific use case requirements:
- Certain problem types may require low-latency, high-throughput output.
- Unpredictable, intermittent and no traffic workloads may exists for some problem types.
- Cost sensitivity may be a factor in decision making, especially when managing multiple models with small resource requirements.
- Some use cases may involve offline bulk data processing.
- In certain situations, there may be a need to handle large payloads asynchronously.
- The problem type may have fluctuating workloads and require auto-scaling capabilities based on demand.
SageMaker offers various options for deploying models for inference. These options are designed to meet the diverse needs that an enterprise may encounter when deploying and managing machine learning models.
Creating an endpoint:
To deploy and serve your model using SageMaker, you need to create an endpoint. An endpoint is an operational runtime environment that serves predictions generated based on the Endpoint Configuration.
The Endpoint Configuration serves as a blueprint, defining the necessary settings such as model selection, instance types, quantity, and various runtime configurations for SageMaker endpoints.
If you need to update an endpoint, you can create or clone the endpoint configuration, make changes, and update the endpoint with the new configuration.
Steps:
- Configure endpoint type β provisioned, serverless, or asynchronous
2. Attach a model to the endpoint configuration
3. Customize the model setting β instance type, count, weight and variant name
4. Lastly, save the endpoint.
Testing an endpoint:
We covered in detail how to test the endpoint using both CLI and UI in the previous article titled βExperiments, Model Training & Evaluationβ. Please visit that article for more details.
Picking a Endpoint Type:
As mentioned previously, depending on your specific use case, you may need to choose from the different endpoint types offered by SageMaker:
Real-time inference:
- Online, real-time inference through REST APIs.
- Ideal for use cases that demand low latency or high throughput.
- Supports payload sizes of up to 6 MB and processing times of 60 seconds.
Serverless Inference:
- Online, near-real-time inference through REST APIs.
- Suitable for use cases with intermittent or unpredictable traffic patterns.
- SageMaker handles instances and scaling policies.
- Supports payload sizes of up to 4 MB and processing times of 60 seconds.
Batch Transform:
- Offline batch processing.
- Suitable for use cases with large datasets that are available upfront.
- No persistent endpoint.
Asynchronous Inference:
- Queue processing, through REST APIs
- Suitable for use cases with long processing times or large payload that are not available upfront.
- Supports payload sizes of up to 1 GB and processing times of 1 hour.
Package with SageMaker Neo
[U+2713] [Reproducibility] How do you package your ML models for serving in the cloud or at the edge?
During AutoML training, the model was provided as a tar package, which can be downloaded and executed locally using your frameworkβs container image. However, when deploying models on the cloud or edge, it is essential to optimize them for the specific platform and environment, such as Amazon Inferentia or Trainium instances.
To do this, you can utilize SageMaker Neoβs compilation jobs. In these jobs, you specify the model version from the model registry created earlier and indicate the target platform for the optimization process to run.
Promote with Endpoint Configurations
[U+2713] [Automation] [cont..] How to compare rolled out model against the running version?
As mentioned earlier, to update an endpoint in SageMaker, you can clone the current endpoint configuration, make modifications, and then update the endpoint using the new configuration.
Publish a new version
For instance, if you want to introduce a new version, you can clone the existing configuration, add the new variant, and update the existing endpoint to use the new configuration.
You have the ability to configure the weights and capacity of the new variant to control the traffic distribution between the new and old models. This allows you to implement changes in SageMaker without any application downtime.
Migrate traffic (A/B Testing / Canary / Blue Green / Rolling Deployment)
Once you have published an endpoint configuration with multiple variants, you will be able to update the weights assigned to each variant. This allows you to selectively migrate traffic to your new model version. From UI, you have the option to migrate all traffic at once, or manually manage it in a fixed step or canary deployment fashion.
In the example below, I have adjusted the variant weights to a 3:1 ratio, which means that 25% of new requests will be routed to the new version of the model.
To automate, you can use the update-endpoint CLI and API.
# Migrate 20% traffic every 300 seconds
aws update-endpoint
--endpoint-name automl-deploy # Endpoint Name
--endpoint-config-name a-b-shadow. # Endpoint Configuration
--deployment-config '
"BlueGreenUpdatePolicy": {
"TrafficRoutingConfiguration": {
"Type": "LINEAR",
"LinearStepSize": {
"Type": "CAPACITY_PERCENT",
"Value": 20
},
"WaitIntervalInSeconds": 300
},
"TerminationWaitInSeconds": 600,
"MaximumExecutionTimeoutInSeconds": 1800
}
'
[U+2713] [Collaboration] How can multiple data scientists understand the impact of their version before releasing it?
As seen above, you can effectively manage the traffic directed to a variant by controlling the weights. However, there are situations where data scientists want to evaluate the impact of a model without directly exposing it to users. In such cases, Amazon SageMaker offers βshadow testing.β
During shadow testing, SageMaker deploys a variant alongside the existing production model. Instead of directing actual user traffic to it, SageMaker sends a replica of real-world traffic to this new model. This allows data scientists to observe the behavior and performance of the version without affecting users.
Additionally, similar to production variants, you can have multiple shadow variants and fine-tune traffic directed to them using weight settings.
[U+2713] [Automation] Are there mechanisms to roll back or revert deployments if issues arise?
The update-endpoint CLI / API methods provide the ability to configure rollback settings. You can set up alarms based on metrics for your endpoint. If the specified alarm condition is triggered, SageMaker will automatically initiate a rollback to a previous configuration.
# Migrate 20% traffic every 300 seconds
aws update-endpoint
--endpoint-name automl-deploy # Endpoint Name
--endpoint-config-name a-b-shadow. # Endpoint Configuration
--deployment-config '
"BlueGreenUpdatePolicy": {
# Removed for brevity
},
"AutoRollbackConfiguration": {
"Alarms": [
{
"AlarmName": "invocation_errors" # Alarm
}
]
}
'
Size with Inference Recommender
[U+2713] [Automation] How do you ensure that the deployed models can scale with increasing workloads?
Autoscaling
Out of the box, SageMaker enables to set up autoscaling for instances that are not of the t2 type, using an alarm. In this example, we are configuring autoscaling to scale up if the average number of requests per minute to an instance exceeds 100.
Inference Recommender
Autoscaling lets you add more instances when thereβs increased workload. However, itβs important to determine the ideal size for each instance. This is where the SageMaker Inference Recommender comes into play.
It assists in instance selection by offering recommendation jobs that help you choose the most suitable instance type and configuration (such as instance count, container parameters, and model optimizations) or serverless configuration (such as max concurrency and memory size).
Record with SageMaker Inference
[U+2713][Governance & Compliance] How do you track the predicted decisions for auditability and accountability?
The SageMaker Endpoint also enables the recording of both the prediction request and response. This facilitates debugging, monitoring, auditing, and when combined with SageMaker Clarify, it can be utilized for model explainability and drift detection. We will explore this topic further in the upcoming article on βMonitoring & Continuous Improvementβ.
U+26A0οΈ Clean-up
If you have been following along with the hands-on exercises, make sure to clean up to avoid charges.
Delete Inference Endpoint
To avoid incurring costs, make sure to delete the endpoint and any other processing jobs.
Summary
In summary, SageMaker enables data scientists to easily deploy and serve models to generate predictions. It allows them to monitor the impact of the new versions and respond using both manual and automated deployment techniques.
In the upcoming article, I will explore how AWS SageMaker can assist with model monitoring and automated retraining.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI