Machine Learning With Azure’s Free Tier

Last Updated on July 20, 2023 by Editorial Team

Author(s): Ranganath Venkataraman

Originally published on Towards AI.

Cloud Computing

How I Continue to Stop Relying on my CPU and Use The Cloud — up to Azure’s free tier limits.

I signed up for a free subscription with Azure after exhausting my free Udacity Nanodegree’s permitted access. The free tier gives 30 days or $200 — whichever came first — of free credits towards computes for running machine learning models. I spent these first on concluding my previous work on time series forecasting and second on the SECOM dataset, following Chapter 11 of Predictive Analytics with Microsoft Azure Machine Learning, Second Edition.

This article narrates these experiences. The original code is in this GitHub repo.

SECOM dataset

This example from page 226 of the book considers data from a semiconductor manufacturing process. Various signals are monitored to predict whether the final product fails (label of 1) or passes (label of -1).

Jupyter Notebook on my computer

I imported features and labels into Python and then removed rows with null values after observing that they comprised <1% of entries.

# Featuresdata = pd.read_csv('secom.data',sep=' ',header=None)
data.isna().sum()
data.fillna(data.mean(),inplace=True)# Labels / targetlabels = pd.read_csv('secom_labels.data',sep=' ',header=None)
labels.columns = ['Class','Time']
target = labels['Class']

Since the TPOT AutoML package combines convenient preprocessing and algorithm selection, I installed the package with pip and used it to fit given data and labels using 7-cross validation

%pip install tpot
from tpot import TPOTClassifierfrom sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=7,shuffle=True)model = TPOTClassifier(generations=5, population_size=50, cv=cv, scoring='f1_weighted',verbosity=2, random_state=1, n_jobs=-1)
# perform the search
model.fit(data,target)
# export the best model
model.export('tpot_secom_best.py')

TPOT produced a pipeline of classifiers. After splitting data and target dataframes into training and testing, I then trained TPOT’s pipeline and evaluated performance on testing data. The code in the second box below was the output of TPOT.

That F1-score of 0.11 is not good — interestingly the weighted F1 score is 0.94. This tells us that there’s significant space for improvement when predicting failures, a conclusion reinforced by the confusion matrix below.

Disappointing TPOT performance drove me to try hyperparameter tuning using GridSearchCV on an XGBoost classifier that accounted for the imbalance in labels.

xgb_params = {'reg_alpha': [0.1,1,10], 
 'reg_lambda': [0.1,1,10],
 'scale_pos_weight':[7,20,40]}cv = StratifiedKFold(n_splits=9,shuffle=True)gsXGB = GridSearchCV(xgb, xgb_params, cv = cv, scoring='f1', 
 refit=True, n_jobs=5,return_train_score=True)
gsXGB.fit(data,target)XGB_best = gsXGB.best_estimator_gsXGB.best_score_

This led to a marginal improvement — an F1 score of 0.21. Tuning the learning rate further offered some more improvements based on the training and testing scores below

How does Azure ML compare with this? Let’s follow the example per page 226 of the book

Azure Designer

The book uses the Designer in Azure’s Machine Learning workspace to drag and drop datasets and modules.

This user-friendly low-code experience largely mirrors the workflow that I executed using a Jupyter notebook on my computer:

remove missing data
select the most salient features, as measured by Pearson correlation with the target labels. I tried anywhere from 40–50 features
split the data with 70% used for training, the rest for test
train boosted trees and SVMs using training data — recall that the TPOT ultimately selected a RandomForest classifier, which is also tree-based
run hyperparameter tuning and score the best model using test data
evaluate performance using metrics of choice: I picked the F1 score, but Azure’s default appears to be the weighted score vs standard binary. Final performance was slightly below that of the full-code experience because of minor variations in hyperparameter settings.

Key take-away: this exercise and others in the book gave me some opportunities to continue trying different tools in Azure, seen next with time series forecasting.

What can be frustrating are the limits placed the power of free computes (CPUs only) in the free tier and the fact that any compute becomes off-limits after 30 days, or $200 spent. Given Microsoft’s vast resources, it’s not unreasonable to expect some marginal CPU or GPU capability on a sustained basis.

Time Series Forecasting with Azure ML notebook

The conclusion of my previous article on time series forecasting summarized the performance of various approaches including ARIMA, H2O.ai AutoML, and LSTM with/without feature extraction. Following are my observations with AzureML’s AutoML forecasting — the code used in Azure ML’s Jupyter Notebook is in this repo.

After importing the data into AzureML and doing the necessary adjustments to get a dataframe that could be manipulated (see Feature Engineering section of this article), I combined labels and features and then created a time series array.

Since the original problem in the UCI repository doesn’t state any time series, I simply used pandas datetime to create a time series array that starts on January 1, 2021 and adds a second for each row. I then called that new column ‘time’.

import datetime
base = datetime.datetime(2021, 1, 1)
arr = np.array([base + datetime.timedelta(seconds=i) for i in range(132240)])together['time'] = pd.Series(arr,index=together.index)

The time series forecasting option in Azure AutoML requires a little more work than conventional ML as outlined by the helpful Microsoft Docs. I first set forecasting parameters by identifying the ‘time’ column created earlier and setting a randomly selected forecasting horizon of 100 seconds.

from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parameters_hydraulic_accum = ForecastingParameters(time_column_name='time', 
 forecast_horizon=100)

Next steps are to configure the experiment for the target variables and then actually run the experiment to observe performance. I completed these steps for hydraulic accumulation and pump leakage; see example for the latter below. I’m establishing the target metric, a timeout to prevent excessive resource usage, 5 folds for cross validation, and chose not to enable ensembling techniques — something to change in retrospect.

from azureml.train.automl import AutoMLConfig

automl_config_leak = AutoMLConfig(task='forecasting', primary_metric='normalized_mean_absolute_error',
 experiment_timeout_minutes=25,
 enable_early_stopping=True,
 training_data=together,
 label_column_name="pump_leak",
 n_cross_validations=5,
 enable_ensembling=False,
 verbosity=logging.INFO,
 forecasting_parameters=forecasting_parameters_hydraulic_accum)# Having set experiment parameters above, create workspace for running experiment.from azureml.core.experiment import Experiment
from azureml.core import Workspace

ws = Workspace.from_config()

# Choose a name for the experiment and specify the project folder.
experiment_name = 'AutoTSForecasting_hyd'
project_folder = './sample_projects/automl-classification'

experiment = Experiment(ws, experiment_name)from azureml.widgets import RunDetails

hydrun = experiment.submit(automl_config_hydraulic_accum, show_output=True)
RunDetails(hydrun).show()
hydrun.wait_for_completion(show_output=True)

Here’s a snapshot of results for pump leakage: AutoML achieved a mean absolute error of 0.06, unable to outperform LSTM.

Figure 6: AutoML time series forecasting modeling results for pump leakage

This exercise continues to emphasize the value of computes especially as my work demands increasingly large computing capabilities.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.