Building a Simple Linear Regression Model on Azure to Predict Car Prices
Last Updated on September 19, 2024 by Editorial Team
Author(s): Julius Nyerere Nyambok
Originally published on Towards AI.
Hello there, gentle reader!
I am Nyerere Julius, an aspiring machine learning engineer and cloud solution specialist, deeply intrigued by the vast possibilities that AI and ML offer.
In this project, Iβll guide you through building a simple machine-learning project on Azure that can greatly add to your portfolio.
Introduction
In this article, weβll create a simple Azure project that predicts car sales prices based on a set of features. Weβll use the Azure platform which you can use via a free account. The steps that I took are as follows:
- Setting up required resources.
- Designing a complete machine learning test pipeline that ingests data, prepares it, trains a linear regression model, scores and evaluates it.
- Creating a real time inference pipeline that we can test out on unseen data.
- Deploying a real time endpoint that will allow us to use our model.
Hereβs the GitHub repository with the resources referenced throughout this article.
Setting up the resources
Iβll strive to simplify the technical jargon that we use in the cloud space at every stage of this article. Azure projects reside in a resource group. A resource group is a logical container that allows you to organize, house, and manage resources. A workspace is a collaborative environment within a resource group that allows users to work on specific resources related to their professions/tasks like data science and analysis.
After creating your free Azure subscription account, navigate to Azureβs homepage, and in the search bar on the top, type βAzure Machine Learningβ which is the resource that we will use for this project. Since we havenβt βhousedβ our resource in any workspace, navigate to the top left and click on + Create then New workspace.
Choose your subscription and create a new resource group to house our workspace (and resources) then click Review + Create. Azure products take a minute or two to be deployed so go and stretch. Once creation is complete, youβll be redirected to the home page where you can select your newly created resource group which should look like this.
Youβll notice additional workspaces/resources deployed alongside your Azure Machine Learning workspace. These are background services that support your project and donβt require interactions. Click on the Azure Machine Learning workspace, scroll to the bottom, and click on the Launch Studio button to access the resource.
Resources are applications provided by cloud providers. Similar to any application, resources require infrastructure to run i.e your web browser requires a laptop for you to access it. This is where Compute Clusters come in. Clusters provide the processing power and virtual architecture required to run our resources. ( Ever heard of virtual machines? ). From the dashboard, navigate down and select Add Compute.
Designing the test pipeline
Now that we have the necessary infrastructure, letβs design a pipeline that will ingest data, clean it, prepare it, score the model, and make predictions. On your dashboard, you are provided with a sidebar on the left that provides useful shortcuts for our project. Click on Designer on the left sidebar. You will be redirected to your pipeline designer dashboard. On the search bar, search for Automobile price data (Raw). This test dataset comprises the prices of various cars. Take a look at the features/columns and take note of the ones that require cleaning. Any hints as to which features require preparation when dealing with a linear regression model?
What youβve done is add a module to your pipeline. A module is a reusable building block that is used in a pipeline to perform various related tasks. We need to select the relevant columns, clean the data by handling missing values, normalize the data, and encode categorical columns before our data is ready for any modeling.
After our mini-data exploration, we realized several things that need to be taken care of before submitting the first part of the pipeline :
- Select a βSelect Columns in a datasetβ module β The βnormalized-lossesβ column is missing a significant amount of data. We will eliminate it by selecting every column and then deselecting the βnormalized-lossesβ column.
- Select a βClean Missing Dataβ module β The columns βbore, stroke, horsepowerβ have a few rows missing. We will delete these rows from the dataset if empty.
- Select a βNormalize Dataβ module β We will scale the numerical columns to fit a specific range using MinMax scaling. The columns to be normalized can be found here.
Click Configure & Submit. This acts as the first part of our pipeline. The pipeline will be submitted as a Job. A job typically refers to a task or process that is executed within a pipeline or workflow. Navigate to jobs on your sidebar and this is what youβll see.
We need to encode our categorical columns next. Navigate back to our designer and follow these steps:
- Select an βEdit Metadataβ module β This allows us to convert our categorical columns from strings to categories. Select the columns shown in figure 10
- Select a βConvert to Indicator Valuesβ module β This allows us to convert our categorical columns to label encodings. Select the columns shown in figure 11
Finally:
- Select a βSplit Dataβ module β This module splits our data into our train and test split.
- Select a βTrain Modelβ module β This module trains the Linear Regression module on the train data
- Select a βLinear Regressionβ module β This module instantiates a linear regression model.
- Select a βScore Modelβ module β This module shows us various scoring metrics that give us an indication of how well our model has performed.
Click configure & submit again. Once successful, check the results of your model in the score model module. Click on Jobs, select your successful pipeline job and right-click on the score model module.
Based on these scoring metrics, our model has performed considerably well. We move to building the real-time inference pipeline.
Designing a real-time inference pipeline
A real-time inference pipeline in Azure refers to a series of steps involved in deploying a machine learning model and using it to make predictions on new data promptly. A real-time inference pipeline in Azure typically separates itself from the training pipeline to simulate how it will perform in a production environment.
Navigate to your jobs and create a real-time inference pipeline as shown.
You will be redirected to the Designer section after it is successful. Remember that the whole point of an inference pipeline is to simulate how the model would perform if exposed to real-world/unseen data. This means that weβll delete the dataset from the pipeline to simulate a real-world scenario where the pipeline receives unseen data. A valuable point to note is that our pipeline is not an βappβ yet. We require some medium that will consolidate the results together and complete our pipeline. These are the steps to take:
- Remove βautomobile price rawβ and the βconvert to datasetβ from the pipeline and add a βEnter data manuallyβ module in their place β This acts as our data entry point to our real-time pipeline. You can find the data to paste here.
- Edit the βSelect columns in a datasetβ β Remove the price column from the selected columns. Remember that this is a real-world scenario so the price is what we are trying to predict thus it doesnβt exist in our data
- Add a βExecute Python Scriptβ β Place this module before the β Web service outputβ. This script renames the score results to price which is what we want displayed in our hypothetical app.
Creating a real-time endpoint
We have completed our real-life imitation and our model is ready for use in the real world. Click on the three dots on the top right and click deploy.
Deploy it as a new real-time endpoint and select Azure Container Instance as the compute type.
This will take about 20β30 minutes so you might have to be patient.
Once the deployment state becomes healthy, you can deploy your model anywhere. Azure provides options such as Python or R code snippets you can integrate into your application and consume the model as shown in Figure 19.
Donβt forget to decommission your Azure resources after use. Cloud resources are not cheap.
Thank you for taking the time to read this article.
Check out other projects Iβve done on my portfolio and GitHub.
ml_engineer_nyerere
Data Science Portfolio Version 1
nyerere-data-scientist.carrd.co
Jnyambok – Overview
Hi! I'm a certified IBM Data Scientist and Google Data Analyst who loves data science. Take a look at what I can do β¦
github.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI