Two’s Company, Three’s an Ensemble
Last Updated on July 17, 2023 by Editorial Team
Author(s): Moshe Sipper, Ph.D.
Originally published on Towards AI.
Ensemble techniques — wherein a predictive model is composed of multiple (possibly) weaker models — are prevalent nowadays within the field of machine learning (ML) and deep learning (DL). Well-known methods such as bagging, boosting, and stacking are ML mainstays, widely (and fruitfully) deployed on a daily basis.
Generally speaking, there are two types of ensemble methods:
- Generate models in sequence. For example, AdaBoost and gradient boosting are such algorithms.
- Generate models in parallel. For example, random forests and evolutionary algorithms are such algorithms.
I’ve written about some of these techniques in a number of Medium stories. If you’re interested in evolutionary algorithms, you’re welcome to read:
Evolutionary Algorithms, Genetic Programming, and Learning
Evolutionary algorithms are a family of search algorithms inspired by the process of (Darwinian) evolution in Nature…
medium.com
And if you’re interested in gradient boosting, I’ve penned this story:
Strong(er) Gradient Boosting
The idea of boosting in machine learning is based on the question posed by Michael Kearns and Leslie Valiant in…
medium.com
Here I’d like to discuss two papers of mine with some neat ensemble “tricks”.
In “Conservation Machine Learning: A Case Study of Random Forests” we took a look at… random forests.
A random forest (RF) is an oft-used ensemble technique that employs a forest of decision trees, each trained on a different sub-sample of the dataset, and with random subsets of features for node splits. An RF uses majority voting (for classification problems) or averaging (for regression problems) to make a final prediction. The use of a forest of trees, rather than a single one, significantly helps improve predictive accuracy and mitigate over-fitting.
We took RFs and basically pushed the ensembling notion one step further, presenting conservation machine learning, wherein models are saved across multiple runs, users, and experiments. Conservation ML is essentially an “add-on” meta-algorithm, which can be applied to any collection of models (or even sub-models). However, they were obtained: via ensemble or non-ensemble methods, collected over multiple runs, gathered from different modelers, a priori intended to be used in conjunction with others — or simply plucked a posteriori.
So how does conservation ML work with random forests?
Well, begin by amassing a collection of models — through whatever means. In our case, we collected models over multiple runs of RF training. For example, assume we perform 100 runs of RFs of size 100 (that is, each random forest contains 100 trees). We are then in possession of 10,000 trees in toto — we called this a jungle.
We might use the entire jungle to make predictions — essentially, treating it like one giant ensemble.
Or, we can choose only some of the models.
Choose how? Good question, to which we presented a number of answers in the form of model-selection algorithms. Such an algorithm judiciously picks, say, a few hundred trees out of the total 10,000 in the jungle, and declares its selection as the “winning” ensemble.
We tested our various ensemble-generating algorithms over 31 classification datasets, taken from various sources. Here’s a figure showing a “bird’s-eye view” of the datasets:
The main result to come out of our study was that conserving models and using them wisely is a good idea that definitely improves performance.
By the way, speaking of conservation, our experiments in this paper alone produced almost 50 million models… As rapper and songwriter will.i.am said, “Waste is only waste if we waste it.”
In another recent paper of mine, “Combining Deep Learning with Good Old‑Fashioned Machine Learning”, I combined machine learning and deep learning using a basic idea known as Stacking, or Stacked Generalization.
Stacking is an ensemble method that uses multiple models to tackle classification or regression problems. The main idea is to first train different models on the original problem. The outputs of these models are considered to be a first level, which is then passed on to a second level to perform the final prediction. The inputs to the second-level model are thus the outputs of the first-level models.
My algorithm, Deep GOld (for Deep Learning and Good Old-Fashioned Machine Learning), involved deep networks as first-level models and ML methods as second-level models. I used 51 pretrained deep networks that are part of PyTorch. Each network was run over 4 datasets:
So now we’ve got 51 models and their outputs over the above 4 datasets. This is the first level of my stacked approach. For the second level, first pick at random n networks of the 51 (I looked at 3 values of n: 3, 7, 11). Now concatenate the outputs of the n networks to form an input dataset for the second level. This is done both for the training and test sets of the datasets.
Consider an example: Suppose n=7, that is, the ensemble contains 7 networks, and the dataset in question is CIFAR100. The first level will create two datasets: a training set with 50,000 samples and 701 features, and a test set with 10,000 samples and 701 features. Why 701? Each network has an output layer of size 100 (100 classes). Multiply that number by 7 networks, and add 1 for the target class. 7 networks × 100 classes + 1 target class = 701.
Now that we have these datasets produced by the first level, we move on to the second level, where we use an ML algorithm selected from a slew of options (logistic regression, k-nearest neighbor, LightGBM, etc’). The job of the ML algorithm is to make the final classification, by being trained on that dataset generated by the first level.
And how did Deep GOld fare? Of 120 experiments, an ML algorithm won in all but ten experiments. That is, our combined DL-ML stacking approach proved quite successful.
Interestingly, we noted that simpler ML algorithms, notably ridge regression, and k-nearest neighbors, worked best. They also happen to be fast, scalable, and amenable to quick hyperparameter tuning.
Thus, Deep GOld leverages a wealth of existing models to attain better overall performance.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI