Site icon Towards AI

Automate Machine Learning using Databricks AutoML — A Glass Box Approach and MLFLow

Author(s): Niranjan Kumar

Automated Machine Learning

Automate Machine Learning using Databricks AutoML — A Glass Box Approach and MLFLow

Databricks AutoML allows you to quickly generate baseline models and notebooks

Databricks recently announced their Databricks AutoML platform during the Data + AI Summit 2021. In this article, we will discuss how to use Databricks AutoML Platform to automatically apply machine learning to a dataset and deploy the model to production using the REST API.

Photo by Ben White on Unsplash

Table of contents:
1. Overview of Databricks AutoML.
2. Setting up the Azure Databricks environment.
3. Configuring AutoML in Azure Databricks.
4. Exploring the notebooks generated by AutoML.
5. Registering the model to the MLflow model registry.
6. Deploy the model using REST API.

Databricks AutoML

AutoML refers to the automation of repetitive tasks in building machine learning or deep learning models. AutoML tries to automate the tasks in the ML pipeline such as data cleaning, feature engineering, handling of categorical features, hyper-parameter tunning with as little manual interaction as possible.

The main aim of AutoML is to bring the machine learning tools to non-machine learning or non-technical experts.

Databricks AutoML allows us to quickly build machine learning models by automating the tasks such as data preprocessing, feature engineering, hyper-parameter tuning, and best model selection. Databricks AutoML integrates with the MLflow to register the best-performed model to the model registry for model deployment (Serving model over REST API).

AutoML workflow (Image Source: Databricks)

Setting up Azure Databricks Environment

In order to continue with the hands-on part of this tutorial, you need a Microsoft Azure account. Don’t worry if you don’t have an account, we can create it for free.

Go to Microsoft Azure Portal and signup for a free account. Once you have created your account, you will be credited with approx 200 USD for exploring azure services for 30 days.

Creating Azure Databricks Service

Before we can open databricks, we need to create an azure databricks service.

Select Databricks (Author Created)
Create Azure Databricks workspace (Author Created)
Databricks Pricing Tier (Author Created)
Azure Service (Author Created)

Launching Azure Databricks Service

To launch the Azure Databricks service, click on the name of the service (TrainingDB). It will open the home page of the service.

Azure Databricks Service Home Page (Author Created)

Once the service home page is opened, select the overview tab there you will see a button “Launch Workspace”. On clicking that button, Azure Databricks will be launched.

Azure Databricks Home Page (Author Created)

Configuring AutoML in Azure Databricks

I hope you are able to flow the tutorial up to this point. The next thing we will do is we will set up an AutoML experiment in Azure Databricks.

Select Machine Learning (Author Created)
Machine Learning Page (Author Created)

For this AutoML experiment, I used Kaggle’s Red Wine Quality dataset to predict whether a wine is of good or bad quality. To convert this into a classification problem, I have transformed the quality feature to boolean good_quality based on the quality score. If the quality score is greater than or equal to 7 then the feature will be set to Trueelse False

Transformed Dataset (Author Created)

Creating Cluster in Databricks

In order to configure AutoML Experiment first, we need to create a cluster in databricks. To create a cluster hover your mouse over the left sidebar in the Databricks workspace and select clusters. It will open the clusters home page, if there are any active clusters you can see them here.

Clusters HomePage (Author Created)
Creating New Cluster (Author Created)

Loading the Data into Databricks

Upload the data file (Author Created)

Setting up AutoML Experiment

AutoML Configuration (Author Created)
Advanced configuration (Author Created)

Exploring the notebooks generated by AutoML

Now that an hour has been passed, AutoML has completed executing different combinations of model iterations.

AutoML Notebooks (Author Created)

AutoML is integrated with MLflow to tracking all the model parameters and evaluation metrics associated with each run. MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry[1].

MLflow makes it easy to compare two or more model runs and select the best model run among these for further iteration or pushing the model to production.

Data Exploration in AutoML

Before we take a deep dive into the model execution runs, Databricks AutoML creates a basic data exploration notebook to give a high-level summary of the data. Click on the view data exploration notebook to open the notebook.

Data Exploration Notebook (Author Created)

Without properly understanding the data, how many advanced algorithms we use on the data to build models it might not give appropriate results. So data exploration is a very important phase in machine learning but it is a time taking process most of the time data scientists will skip this phase.

Databricks AutoML saves time by creating the baseline data exploration notebook, data scientists can edit this notebook to expand on the data analysis techniques.

Correlation Plot (Author Created)

Internally AutoML uses pandas profiling to give information about correlations, missing values, and descriptive statistics of the data.

Exploring the AutoML Run Notebook

Now that you have a good understanding of the data after going through the data exploration notebook and you want to look at the AutoML model building code.

Databricks AutoML displays the model results and provides an editable python notebook with the source code for each trial run so that we can review or modify (eg: create a new feature and include that in the model build) the code. Under the source column on the experiment home page, you will see the reproducible notebook for each trial run — that is why it is called a Glass Box Approach (Allows you to look under the hood).

In order to open the source code for a run, click on the notebook icon under the source column for that run.

XGBoost Training Notebook (Author Created)

Each model in the AutoML runs is constructed from open source components, such as scikit-learn and XGBoost. They can be easily edited and integrated into the machine learning pipelines.

Data scientists or analysts can utilize this boilerplate code to jump-start the model development process. Additionally, they can use their domain knowledge to edit or modify these notebooks based on the problem statement.

Registering the model to the model registry

Before we can deploy our model for serving, we need to register the model in the MLflow model registry.

Model Registry is a collaborative hub where teams can share ML models, work together from experimentation to online testing and production, integrate with approval and governance workflows, and monitor ML deployments and their performance[2].

To register a model on the model registry click on the run of your choice (my choice is the best run i.e.. the top run) and scroll down to the artifacts section & click on the model folder.

Artifacts (Author Created)
Register Model (Author Created)

Now that we registered our model into the model registry, click on the popup icon located at the top right corner of the artifacts section to open the model registry user interface.

Registered Models (Author Created)

Exploring Model Registry

The model registry provides information about the model, including its author, creation time, its current stage, and source run link.

Using the source link you can open the MLflow run that was used to create the model. From the MLflow run UI, you can access the source notebook link to view the backend code for creating the model[3].

Model Registry (User Interface)

Deploy the model using REST API

We are in the final section of this tutorial. In this section, we will discuss how to push the model into production and serve the model using REST API.

Changing Model Stage

Model Staging (Author Created)

In an enterprise setting, when you are working with multiple teams such as the Data Science team and MLOps team. A member of the data science team requests the model transition to staging, where all the model testing takes place.

In this tutorial, I will directly push the model to production for simplicity. To push the model to production, Select Transition to -> Production, enter your comment, and press OK in the stage transition confirmation window to transition the model to Production.

Stage Transition (Author Created)
Log (Author Created)
Wine Model Registry (Author Created)

Model Serving

Model serving in Databricks is performed using MLflow model serving functionality. MLflow performs real-time model serving using REST API endpoints that are updated automatically based on the availability of model versions and their stages.

Serving (Author Created)

Predictions using Served Model

Serving Cluster Status (Author Created)

Testing Model Serving

Model Response (Author Created)

Model URL

MLflow Model Serving on Azure Databricks – Azure Databricks – Workspace

Cluster Settings

Cluster Settings (Author Created)

Note:

Remember to stop both the clusters before you log out from your azure account.

There you have it, we have successfully configured the AutoML experiment and pushed the best into production for serving in real-time.

Photo by Alasdair Elmes on Unsplash

What’s Next?

Practice Practice

In this tutorial, we only covered the classification problem but you can also try out the regression problem using AutoML and get a complete understanding of model automation and transiting the model to production.

Conclusion

In this article, we started by discussing the overview of databricks AutoML. After that, we have discussed how to set up and launch the azure databricks service inside the Azure portal. Then we have seen how to configure the AutoML experiment using the red wine quality dataset. Finally, we have learned how to transition the model stages in the model registry and enable model serving using the REST API.

Feel free to reach out to me via LinkedIn or Twitter if you have any issues completing this tutorial. I hope this article has helped you in getting started with AutoML and understanding how it works.

In my next blog, we will discuss how to set up PySpark on your local computer and get started with PySpark Hands-on. So make sure you follow me on Medium to get notified as soon as it drops.

Until next time Peace 🙂

NK.

References:

  1. https://mlflow.org/
  2. https://databricks.com/product/mlflow-model-registry
  3. MLflow Model Registry example
  4. MLflow Model Serving on Azure Databricks


Automate Machine Learning using Databricks AutoML — A Glass Box Approach and MLFLow was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Exit mobile version