Revolutionising Machine Learning: Achieving Top 4% in Kaggle with AutoGluon in Just 7 Lines of Code
Last Updated on December 11, 2023 by Editorial Team
Author(s): Daniel Voyce
Originally published on Towards AI.
Since starting a new Data Engineering role at Slalom _build, I realized I needed to refresh my ML experience as it was a couple of years out of date. A couple of years in Data Engineering / Data Science is an eternity, and I figured that there would have been a whole load of automation created for some of the more arduous ML tasks. I was delighted to see that AutoML was becoming a thing that is being used regularly in ML workflows so I wanted to try it out for myself.
Exploring AutoML: Simplifying Machine Learning
AutoML revolutionizes how machine learning models are developed. In traditional ML, experts engage in time-intensive tasks like data preprocessing, feature selection, and model tuning. AutoML automates these complex processes, significantly reducing the time and expertise needed to build effective models for tasks such as classification, forecasting, and regression. This innovation not only accelerates model development but also makes advanced ML more accessible to a wider audience as anyone who knows Python can implement a model to get predictions very quickly!
Why is AutoML needed?
In short β efficiency. Even if there is an accuracy tradeoff (which, in my experience, is not the case, as I will explain later in my example), there is something to be said for being able to get predictions in a few lines of code. Additionally, because it is automated, if you find the predictions are accurate enough for your use case β the process can auto-adapt to changing data and features as it is retrained without any extra human input
How does it work?
AutoGluon performs a number of steps to automatically process and prepare your data for the training, at a high level, these steps are:
- Data Loading and Inspection
AutoGluon expects the data to be in a tabular format β in this case a Pandas Dataframe with a target column specified, it also runs a few initial checks on the data such as size, format, and types of variables (numerical, categorical, text, etc). - Data Cleaning
AutoGluon identifies and imputes missing values. It does this using a number of strategies such as filling missing values with the mean, median, or mode for numerical columns, or a special token for categorical columns, it also removes duplicates. - Feature Engineering
It automatically performs one-hot / label / or more sophisticated encoding for categorical data, it uses NLP to convert text into a numerical format that ML models can use, and it will expand date and time features into more useful items such as day of the week, month, etc - Data Transformation
It will perform transformations like Normalisation and scaling if necessary, depending on the algorithm, and might transform specific features using things like Log transformations to fix skewed distribution in datasets. - Feature Selection & Dimensionality Reduction
It automatically calculates feature correlation and removes features that do not contribute to the predictive power of the model and reduces dimensionality using techniques like PCA (Principal Component Analysis) to reduce the number of dimensions. - Data Splitting
AutoGluon can automatically split and test within itself to produce training, validation and test data sets. - Model Selection
AutoGluon automatically tests various models with the split datasets it has previously created and ranks them according to the measurement metric selected - Ensemble model creation
More often than not, it will create an Ensemble model, this is done by stacking any model that doesnβt have zero weight, you can read more about model stacking here: https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205 but it looks like this:
This list should be recognizable to many Data Engineers / Data Scientists, these are the steps that are usually taken in data cleansing and preparation for ML, and this should give you an idea of how powerful it is.
What about the results? Is it better than βManual MLβ?
As a newly minted consultant, I have the perfect answer for this:
β It depends
I will use the example of the Kaggle competition I entered:
House Prices β Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting
www.kaggle.com
The competition is around predicting house prices based on a number of property features, the data is presented as an 81 column CSV.
Testing βout of the boxβ performance with AutoGluon
To test the out-of-the-box performance with AutoGluon, I followed the documentation to build up the structure it requires and then trained the model using βbest_qualityβ
- Loading training data:
I read the training data into a pandas dataframe so I could view the structure of it, this is required for a tabular predictor (which we are using for this).
import pandas as pd
from autogluon.tabular import TabularDataset, TabularPredictor
df = pd.read_csv("/kaggle/input/house-prices-advanced-regression-techniques/train.csv")
df.head(20)
--- out ---
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Normal 118000
10 11 20 RL 70.0 11200 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 129500
11 12 60 RL 85.0 11924 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 7 2006 New Partial 345000
12 13 20 RL NaN 12968 Pave NaN IR2 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 144000
13 14 20 RL 91.0 10652 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 New Partial 279500
14 15 20 RL NaN 10920 Pave NaN IR1 Lvl AllPub ... 0 NaN GdWo NaN 0 5 2008 WD Normal 157000
15 16 45 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv NaN 0 7 2007 WD Normal 132000
16 17 20 RL NaN 11241 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 700 3 2010 WD Normal 149000
17 18 90 RL 72.0 10791 Pave NaN Reg Lvl AllPub ... 0 NaN NaN Shed 500 10 2006 WD Normal 90000
18 19 20 RL 66.0 13695 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 159000
19 20 20 RL 70.0 7560 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 5 2009 COD Abnorml 139000
20 rows Γ 81 columns
2. Specifying the label to target
The column we are predicting is SalePrice β so we set the label and look at some basic stats of that column:
label = 'SalePrice'
df[label].describe()
--- out ---
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
3. Train the model(s)
Training a model on Tabular data in AutoGluon is very simple, you pass it the label and the dataframe and set it to work, it will then train and evaluate whatever models are in the group you specify (best_quality in this case):
predictor = TabularPredictor(label=label, path="/kaggle/working", ).fit(df, presets='best_quality')
--- out ---
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=1, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "/kaggle/working/"
AutoGluon Version: 0.8.0
Python Version: 3.10.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Sat Jun 24 10:55:41 UTC 2023
Disk Space Avail: 20.94 GB / 20.96 GB (99.9%)
Train Data Rows: 1460
Train Data Columns: 80
Label Column: SalePrice
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
Label info (max, min, mean, stddev): (755000, 34900, 180921.19589, 79442.50288)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 32451.21 MB
Train Data (Original) Memory Usage: 4.06 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 3 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 3 U+007C ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
('int', []) : 34 U+007C ['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', ...]
('object', []) : 43 U+007C ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 40 U+007C ['MSZoning', 'Alley', 'LotShape', 'LandContour', 'LotConfig', ...]
('float', []) : 3 U+007C ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
('int', []) : 34 U+007C ['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', ...]
('int', ['bool']) : 3 U+007C ['Street', 'Utilities', 'CentralAir']
0.7s = Fit runtime
80 features in original data used to generate 80 features in processed data.
Train Data (Processed) Memory Usage: 0.52 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.73s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
...
AutoGluon training complete, total runtime = 1423.48s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/kaggle/working/")
The process takes around 30 minutes to train all of the different models.
4. Running predictions
As Kaggle already provided the test data set for submission, we can simply use this and then submit the results
test_data = TabularDataset(f'/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
y_pred = predictor.predict(test_data)
y_pred.head()
--- out ---
Loaded data from: /kaggle/input/house-prices-advanced-regression-techniques/test.csv U+007C Columns = 80 / 80 U+007C Rows = 1459 -> 1459
WARNING: Int features without null values at train time contain null values at inference time! Imputing nulls to 0. To avoid this, pass the features as floats during fit!
WARNING: Int features with nulls: ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageCars', 'GarageArea']
0 121030.359375
1 164247.546875
2 186548.234375
3 193435.875000
4 184883.546875
Name: SalePrice, dtype: float32
5. Submitting results
To submit the results to Kaggle is quite easy (this is my first Kaggle competition so it was all pretty new to me)
submission = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice'] = y_pred
submission
--- out ---
Id SalePrice
0 1461 121030.359375
1 1462 164247.546875
2 1463 186548.234375
3 1464 193435.875000
4 1465 184883.546875
... ... ...
1454 2915 79260.718750
1455 2916 82355.296875
1456 2917 165794.281250
1457 2918 111802.210938
1458 2919 212835.937500
1459 rows Γ 2 columns
submission.to_csv("/kaggle/working/submission.csv", index=False, header=True)
Results
Out of the box, Autogluon got 0.12082 which is placed 252 out of 4847 β top 5%, which is incredible for about 7 lines of actual code.
If we exclude the top 50β75ish people (most of whom have cheated & gamed the system via data leakage)β¦
β¦then this is in the top 4%
*all figures correct at time of article
Can I improve on it?
My best entry came in at 0.11985 which is placed 117 out of 4847 β that is top 2.5%
This used AutoGluon and some fairly basic normalization, feature engineering, and various things like excluding non-correlating pairs.
So my (admittedly rusty) expertise only added a measly 0.00097 benefit to the score βIβm not sure this would be noticeable in a real life project.
Conclusion
You can view the notebook here to see for yourself: Version 8 was the best score and Version 1 was the OOB version (you can browse the code versions by selecting the 3 dots menu and the version you want)
notebookc8be23bae5
Explore and run machine learning code with Kaggle Notebooks U+007C Using data from House Prices β Advanced Regressionβ¦
www.kaggle.com
There are obviously several people in that competition who have produced much more performant models and used the full spectrum of data science techniques to get there, but for sheer results vs time, AutoGluon is a winner in its own class.
AutoGluon is an incredibly powerful tool in your ML arsenal. Its ability to provide highly accurate forecasts with only a few lines of code is unmatched, in my opinion, as proven by the Kaggle competition. It wonβt outdo a very talented data scientist/engineer who is skilled in their own tools, but it also won't take weeks to get a very accurate prediction on data either.
About the author
Dan is a Principal of Data Engineering at Slalom, focusing on modernizing customers' data landscapes, machine learning, and AI.
A start-up veteran of over 20 years, with a specialty in building high-performance development teams and leading them to successfully deliver solutions for some of the largest names in Australia and the UK.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI