A Guide to MLOps in Production
Last Updated on January 1, 2023 by Editorial Team
Author(s): Prithivee Ramalingam
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
with Azure DevOps, Azure Kubernetes Service, and Azure MachineΒ Learning
Introduction
Countless hours of organized effort are required to bring a model to the production stage. The efforts which were spent on all the previous steps would turn out to be fruitful only if the model is successfully deployed. For deploying the model to production, we must consider an architecture that is secure, highly available, fault-tolerant, and capable of autoscaling.
If you have figured out where I am going with this, then kudos to you. Yes, Kubernetes is the ideal candidate for our job. But setting up a Kubernetes cluster is not an easy task. We have to set up and configure load balancers, ingresses, role-based authentication, Virtual Machines, network policies, and so on. This requires a considerable amount of time which can be better spent perfecting our model. Thatβs why Kubernetes as a Service (Kaas) is the preferred option under these circumstances.
This article illustrates the whole lifecycle of MLOps, starting from training the model to deploying it in a production environment. The whole process is automated. The CI pipeline gets triggered whenever we make a code change, and the CD pipeline gets triggered whenever there is a new artifact available or if the CI pipeline gets completed successfully. In this way, the new functionality can be deployed with just a single commit. The below image provides a high-level overview of the wholeΒ process.
If you require an End-to-End example of a CI-CD pipeline for development and QA environments, you can refer to my article -> End-to-end example of CI-CD pipeline using Azure Machine Learning | by Prithivee Ramalingam | Apr, 2022 | TowardsΒ Dev
Content
- Prerequisites.
- Code walkthrough.
- Creating a CI pipeline in Azure DevOps to train the model and publish the artifact.
- Creating a CD pipeline to deploy the model in Azure Kubernetes ServiceΒ (AKS).
- Testing the model deployed inΒ AKS.
1. Prerequisites:
1.1 Azure DevopsΒ account
1.2 Azure Machine LearningΒ resource
1.3 Azure Kubernetes Service (AKS)Β cluster
1.4 Resource Manager connection
1.1 Azure DevOpsΒ account
Azure DevOps Server is a Microsoft product that provides version control, project management, automated builds, testing, and release management capabilities. To create an Azure DevOps account, please follow these instructions.
In a production setup, automation must be preferred over manual interference. We need to automate the process of creating builds and deploying them to a Kubernetes cluster. For this use case, we make use of the pipelines provided by Azure DevOps. We create a Continuous Integration (CI) pipeline for creating a model and packaging it and a Continuous Deployment (CD) pipeline for deploying the model in Kubernetes.
1.2 Azure Machine LearningΒ resource
Azure Machine Learning is a cloud service for accelerating and managing the machine learning project lifecycle. Azure Machine Learning, teamed up with MLFlow, providing a centralized repository to host docker images, runs, environments and artifacts. It provides functionality to create compute clusters and instances. For additional information regarding Azure Machine Learning, you can refer to my article https://medium.com/mlearning-ai/features-and-capabilities-of-azure-machine-learning-868eb3b4d333.
In this production setup, we are going to use Azure ML for logging metrics, saving model artifacts, creating an AKS cluster, and deploying the model in the created cluster. While creating an Azure Machine Learning workspace, Blob storage, a Key vault, a Container registry, and an application insights service are created along withΒ it.
1.3 Azure Kubernetes Service (AKS)Β cluster
Azure Kubernetes Service (AKS) offers the quickest way to start developing and deploying cloud-native apps in Azure. With the abstraction provided by Azure Machine Learning, we can manage deployment in AKS by configuring just a few variables. We can attach an existing AKS cluster or create a new one with AzureΒ ML.
1.4 Resource Manager connection
A Resource manager connection is essential to automate model training, model packaging, model deployment, etc.. Azure ML is the place that centralizes the whole MLOps process. So, we need to access Azure ML securely to perform the above steps. One way is going to Azure ML workspace and setting the resources manually. But in order to do it in an automated manner, we require a resource manager connection which will help us manage the resources in Azure ML from AzureΒ DevOps.
We should create the Resource Manager connection in Azure DevOps. We can assign the scope to the Subscription level, Management Group level, or to Machine Learning Workspace level. We can also limit access to the pipelines which can access this Service Principal.
Project Settings -> Service Connections -> New Service connections -> Azure Resource Manager -> Service Principal (automatic) -> ScopeΒ level
2. Code walkthrough.
After creating all the required prerequisites, the next step is to write code for training and inference and to push it into a version control system like git or Azure DevOps. You can access the source codeΒ here.
2.1 TrainingΒ script
2.2 Inference scripts
2.1 TrainingΒ script
The training script consists of training.py and files in the environment_setup directory. The environment_setup directory consists of install-requirements.sh, conda_dependencies.yml, and runconfig files.
install-requirements.sh -> Has the dependencies which have to be installed in theΒ agent.
conda_dependencies.yml -> Has the dependencies which have to be installed in the compute target. The dependencies are abstracted as environments.
runconfig file (titanic_survival_prediction.runconfig) -> The driver file for the whole training logic. During execution, the runconfig file gets visited first. From there information such as the location of training.py file, environment details, docker image details, and location of conda dependency files are obtained.
framework: Python
script: training.py
communicator: None
autoPrepareEnvironment: true
maxRunDurationSeconds:
nodeCount: 1
environment:
name: titanic_prediction_environment
python:
userManagedDependencies: false
interpreterPath: python
condaDependenciesFile: .azureml/conda_dependencies.yml
baseCondaEnvironment:
docker:
enabled: true
baseImage: mcr.microsoft.com/azureml/base:intelmpi2018.3-ubuntu16.04
sharedVolumes: true
gpuSupport: false
shmSize: 1g
arguments: []
history:
outputCollection: true
snapshotProject: true
directoriesToWatch:
- logs
training.py -> Has the logic for getting data, feature engineering, feature selection, model training, and uploading the generated model to run. The newly created model will be stored in the directory provided by the βmodel_pathβ parameter. But the model will exist in the folder only till the agent exists. So, we upload it to the artifact location provided by the βartifact_loc β parameter.
2.2 Inference scripts
The inference scripts are placed in the deployment folder.
conda_dependencies.yml -> has the necessary dependencies for running the score.pyΒ file.
inferenceConfig.yml -> This is the driver file for inference. It has information like the path to the scoring file and the dependencies file along with the runtimeΒ used.
score.py -> Has the inference logic. It contains 2 mandatory functions, init() and run(). The init() function is called only once that is during the creation of the service. The model loading logic is written inside the init function. The run function consists of logic to receive the data from the API call, perform analysis and inference on the data and return the response.
aciDeploymentConfig.yml -> has the configuration to deploy the model as an Azure Container InstanceΒ (ACI)
aksDeploymentConfig.yml -> has the configuration to deploy the model in Azure Kubernetes Service (AKS) cluster. For more information about the parameters please refer toΒ this.
computeType: AKS
autoScaler:
autoscaleEnabled: True
minReplicas: 3
maxReplicas: 6
refreshPeriodInSeconds: 1
targetUtilization: 50
authEnabled: True
containerResourceRequirements:
cpu: 0.5
memoryInGB: 1
appInsightsEnabled: True
scoringTimeoutMs: 60000
maxConcurrentRequestsPerContainer: 2
maxQueueWaitMs: 180000
sslEnabled: False
3. Creating a CI pipeline in Azure DevOps to train the model and publish the artifact.
3.1 Agents
3.2 Variables
3.3 Tasks
3.4 Triggers
3.1 Agents
For running the CI-CD pipeline, we require agents. Agents are nothing but compute instances that will run the pipeline. It can be self-hosted agents or Azure Pipelines. We can also choose the OS of the agent based on different versions of Windows, Linux, and Mac. It is also possible to run pipelines in parallel onΒ Agents.
3.2 Variables
Inside the app, we might have to access multiple databases and services which in turn require a username, password, and endpoint credentials. Including them along with the code is not secure. So, Azure DevOps came up with variables to provide a secure way of accessing the credentials.
We define the credentials as key-value pairs and access them at runtime. Variable groups are a collection of variables that are specific to a particular pipeline or a group of pipelines. The variable groups need to be linked with the pipeline in order to be accessed.
3.3 Tasks
A Task is an abstraction provided by Azure Pipeline to perform an action. Tasks can be used to build an app, run a tool, perform installation, construct builds, etc.. Multiple tasks constitute a pipeline, they are executed step by step based on specifications. The CI pipeline consists of multiple tasks to publish a model as an artifact.
Task 1βββUse PythonΒ 3.x
ExplanationβββWe need to install Python Environment for running pythonΒ scripts
Task 2βββInstall dependencies (Bash)
TypeβββFileΒ path
Explanationβββinstall-requirements.sh file has certain dependencies which are required to run theΒ scripts
environment_setup/install-requirements.sh
Task 3βββAzure CLI ML installation (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
ExplanationβββAzure CLI ML extension is required as all the Azure ML based commands are executed using that extension.
az extension add -n azure-cli-ml
Task 4βββCreate compute target (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
ExplanationβββWe need access to a compute target to run the training script. Based on our requirements we can choose among CPU or GPU enabled compute targets. We can define the min and max nodes for the compute target to handle load. To save cost we can also make use of the idle-seconds-before-scaledown flag which stops the compute once it is left idle for the specified time period. Finally we have to assign system identity for theΒ compute.
az ml computetarget create amlcompute -g $(ml.resourceGroup)
-w $(ml.workspace)
-n $(ml.computeName)
-s $(ml.computeVMSize)
- min-nodes $(ml.computeMinNodes)
- max-nodes $(ml.computeMaxNodes)
- idle-seconds-before-scaledown $(ml.computeIdleSecs)
- assign-identity '[system]'
Task 5βββCreate model and metadata folderΒ (Bash)
TypeβββInline
Explanationββββmetadataβ and βmodelsβ folders are created as dumping stations for the metadata generated by the models and modelsΒ itself.
mkdir metadata && mkdir models
Task 6βββTraining the model (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
az ml run submit-script -g $(ml.resourceGroup)
-w $(ml.workspace)
-e $(ml.experimentName)
- ct $(ml.computeName)
-c titanic_survival_prediction
- source-directory .
- path environment_setup
-t ./metadata/run.json
training.py
- model_path ./models/titanic_model.pkl
- artifact_loc ./outputs/models/
ExplanationβββWe trained the model on the computer that we had created earlier. The c flag refers to the run configuration file, which primarily acts as a template for training. The source Directory points to the location of training.py. We provide the experimentName in this step, and accordingly, an experiment would be created, and the metrics and dependencies would be logged withinΒ it.
model_path -> Path to store theΒ model
artifact_loc -> Devops artifact location to store theΒ model
Every training script is run as an experiment, and each experiment has a run ID. We dump the model in the βmodelsβ folder which we had created in the previous step. After dumping the model in the model path specified, we upload the model to the artifact location. This is done so that we can find all the related files in the particular run of the model. It will be available in the workspace. We can download the model in outputs/models/titanic_model.pkl
After uploading the model, the generated run.json file is kept in the metadata folder. The run.json file contains information such as the experiment name, experiment id, the details of the person who has created the model, the id of computing target, git URL, mlflow id, and mlflow repoΒ URL.
Task 7βββRegister the model in the model registry (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
ExplanationβββWe register the model which is in artifact location to the model registry. With the model being in model registry we can access it as many times as we want and versioning can also be done efficiently. We provide a name for the model so that all the versions of the model are placed in the same repository. We can also provide tags to the model which is getting registered. Finally the model.json file is given as output. Model.json file has modelId, workspacename, resourceGroupName
az ml model register -g $(ml.resourceGroup)
-w $(ml.workspace)
-n Titanic
- asset-path outputs/models/
-d "Titanic Survival Prediction"
- tag "model"="Titanic Survival Prediction model"
- model-framework Custom
-f ./metadata/run.json
-t metadata/model.json
Task 8βββCopy Files to: $(Build.ArtifactStagingDirectory) (CopyΒ Files)
Source Folderβββ$(Build.SourcesDirectory)
Contentsβββ**/metadata/*
**/environment_setup/*
**/deployment/*
Target Folderβββ$(Build.ArtifactStagingDirectory)
ExplanationβββAll the code and intermediate folders will be present in the Source directory, but we have to have these folders in the Artifact staging directory. So, it can be used by CD pipeline as an artifact.
Task 9βββPublish pipelineΒ artifact
File or Directory Pathβββ$(Pipeline.Workspace)
Artifact publish locationβββAzure Pipelines
Artifact nameβββtitanic_pred_artifact
Keep custom properties asΒ empty
ExplanationβββThis is done so that the metadata in the artifact location can be published.
Task 10βββDelete compute instance (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
ExplanationβββAfter model training we can delete the target which we have used to train theΒ model.
az ml computetarget delete -g $(ml.resourceGroup)
-w $(ml.workspace)
-n $(ml.computeName)
3.4 Triggers
Triggers enable the CI pipeline to run automatically based on some action. In other words, if there is a new commit, the CI pipeline will run and create new models. We have the option to select the branch which needs to be watched, and it is not a necessity that only the main branch can be used as a trigger. To set up a trigger for the CI pipeline, we need to enable continuous integration.
4. Creating a CD pipeline to deploy the model in Azure Kubernetes ServiceΒ (AKS).
After the CI pipeline has finished running successfully, we can trigger the CD pipeline to run automatically. From the below diagram, we can get a high-level overview of the CD pipeline. We canβt straightaway deploy a model to production (AKS), so we first deploy it in dev (ACI). When all the tests get executed successfully in Dev, we move to Production.
Task 1βββUse PythonΒ 3.x
Task 2βββAzure CLI ML installation (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
az extension add -n azure-cli-ml
Task 3βββCreate AKS cluster (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
az ml computetarget create aks -g $(ml.resourceGroup)
-w $(ml.workspaceName)
-n $(aks.clusterName)
-s $(aks.vmSize)
-a $(aks.agentCount)
ExplanationβββWe create an azure Kubernetes cluster by specifying the type of VM which is required and the number of agents. The number of agents should be at least 3. In Azure, ML parlance agents are the equivalent of nodes. Usually, there is a quota of 10 for the number of virtual CPUs which can be used, so a bit of configuration is required if we are looking for higher-end VMs which have multiple virtualΒ CPUs.
AlternativeβββInstead of creating the Azure AKS cluster through Azure CLI we can also use Azure ML to create the cluster. In the compute option, we have to go to inference clusters and create new clusters by specifying the number of nodes and the specifications (RAM, Cores, Storage).
Task 4βββDeploy the model (AzureΒ CLI)
Include Resource manager connection
Script typeβββShell
Script locationβββInlineΒ script
Working Directoryβββ$(System.DefaultWorkingDirectory)/_Titanic Pred CI/titanic_pred_artifact/a/deployment
az ml model deploy -g $(ml.resourceGroup)
-w $(ml.workspace)
-n $(ml.serviceName)
-f ../metadata/model.json
- dc aksDeploymentConfig.yml
- ic inferenceConfig.yml
- description "Titanic Prediction model deployed in ACI"
βoverwrite
ExplanationβββWe are currently in the deployment folder, as we have specified in our working directory. The inferenceConfig.yml and aksDeploymentConfig.yml are present in the deployment folder, so we keep them as it is. We also require model.json, which is in the metadata folder. So we can provide the path forΒ that.
5. Testing the model deployed inΒ AKS.
5.1 Functionality testing
5.2 StressΒ testing
After the model is deployed in the AKS cluster, we need to test the service to determine the shortcomings, if any. In this example, we are going to perform functional testing and stressΒ testing.
5.1 Functionality testing
A functionality test is done to check whether the model is performing as expected. Multiple test cases have to be written to check edge cases and execution time.
import requests
import json
url = "<Enter AKS url here>"
payload = json.dumps({
"Pclass": 1,
"Sex": 0,
"Age": 38,
"SibSp": 1,
"Parch": 0,
"Fare": 71.2833,
"Embarked": 2
})
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
response_data = json.loads(response)
if response_data['prediction'] == 1:
print("The prediction is correct")
else:
print("The prediction is wrong")
5.2 StressΒ testing
Since we have deployed the model in a Kubernetes cluster, we would be able to handle multiple requests at the same time. So, to check if the scaling (Upscaling and downscaling) is happening as expected and to make sure that the model is highly available, we perform stress testing. Stress testing involves sending multiple requests at the same time, more than what we will be receiving under normal circumstances. If we are not able to handle all the requests, we have to change the parameters in the AKS config file to suit ourΒ needs.
import asyncio
import json
import aiohttp
async def do_post(session, url, x):
async with session.post(url, data = json.dumps(x),headers={"Content-Type": "application/json"}) as response:
data = await response.text()
return data
async def main_func():
url = '<Enter the AKS url here>'
resp_ls = [ {"Pclass": 3, "Sex": 1, "Age": 34.5, "SibSp":0,"Parch":0,"Fare":7.8292,"Embarked":2},
{"Pclass": 3, "Sex": 0, "Age": 40, "SibSp":1,"Parch":1,"Fare":2.56,"Embarked":0},
{"Pclass": 3, "Sex": 0, "Age": 12, "SibSp":1,"Parch":2,"Fare":300,"Embarked":1},
{"Pclass": 3, "Sex": 1, "Age": 18, "SibSp":1,"Parch":5,"Fare":4.1,"Embarked":0},
{"Pclass": 3, "Sex": 0, "Age": 71, "SibSp":0,"Parch":1,"Fare":72,"Embarked":1},
{"Pclass": 3, "Sex": 1, "Age": 26, "SibSp":0,"Parch":0,"Fare":1.8,"Embarked":2}
]
async with aiohttp.ClientSession() as session:
post_tasks = []
# prepare the coroutines that post
for req in resp_ls:
post_tasks.append(do_post(session, url, req))
# now execute them all at once
val = await asyncio.gather(*post_tasks)
return val
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
val = asyncio.run(main_func())
failed_request_count = 0
for resp in val:
try:
d = json.loads(resp)
print(d)
except:
failed_request_count += 1
print("Number of requests sent",len(val))
print("Number of requests failed",failed_request_count)
Conclusion
This article shows the step-by-step process for handling MLOps in production. Azure DevOps was used to automate the CI and CD pipelines, Azure Machine Learning was used as a centralized resource for handling MLOps, and Azure Kubernetes Service was used to deploy the created model into production. This method is highly scalable, and new models can be rolled out to production quickly.
References
- (7412) Azure MLOpsβββDevOps for Machine LearningβββYouTube
- (7412) MLOps with AzureβββHands on SessionβββYouTube
- Create a deployment config for deploying an AKS web serviceβββaks_webservice_deployment_config β’ azuremlsdk
Want toΒ Connect?
If you have enjoyed this article, please follow me here on Medium for more stories about machine learning and computerΒ science.
Linked InβββPrithivee Ramalingam |Β LinkedIn
A Guide to MLOps in Production was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI