Customizing sk-learn Models and Pipelines
Last Updated on January 29, 2024 by Editorial Team
Author(s): Reinhard Sellmair
Originally published on Towards AI.
Sk-learn offers a wide variety of models that can be easily plugged in and tested due to their modular design. Furthermore, modularity also allows the combination of models with multiple pre-processing steps to create a pipeline that processes data to features that are then fed to the model.
In addition to combining prepackaged modules, we can create customized transformers specifically designed for the problem we want to solve. I will show how to:
- Integrate customized labeling and feature engineering
- Create a multiclass classification wrapper that can integrate any classification model
- Combine modules in a customized pipeline
Problem
As an example, Iβm using sk-learnβs California housing dataset and creating a multiclass-classification model that predicts adjustable price categories.
Labeling
The dataset contains continuous prices that are converted into categories with respect to the provided thresholds. Each category is labeled by an ID, which the multiclass model is supposed to predict. One reason for rephrasing a regression problem into a classification problem could be that the user wants to focus on a specific price range and requires a model that can predict this range with high accuracy.
The class below uses thresholds to define price categories, converts prices into categories, and reverts categories back to price ranges.
This class is initialized by providing a list of thresholds, which are converted to a list of labels that describe the price range and a list of IDs. The methods price_to_id converts prices into categories and returns the corresponding category ID, and id_to_label reverts a category ID to the corresponding label.
class Price_LabelHandler():
def __init__(self, thresholds):
# sort thresholds
self.thresholds = sorted(thresholds)
# convert thresholds to labels
self.labels = [f'price <= {self.thresholds[0]}']
for low, high in zip(self.thresholds[:-1], self.thresholds[1:]):
self.labels.append(f'{low} < price <= {high}')
self.labels.append(f'{high} < price')
# initialise ids for each class
self.ids = range(len(self.labels))
def price_to_id(self, price):
for threshold, id in zip(self.thresholds, self.ids[:-1]):
if price <= threshold:
return id
return self.ids[-1]
def id_to_label(self, id):
return self.labels[id]
Predictor
The price label handler is integrated in the predictor so that transformations from prices to categories and vice versa can be done seamlessly.
To initialize the price classifier a multiclass classifier object and the price thresholds need to be provided. To integrate this predictor into a sk-learn pipeline following methods are implemented. The fit method converts the prices to IDs and fits the classifier, applies the classifier to predict IDs, and converts the IDs to labels. Probabilities are calculated with predict_proba where the column names of the returned dataframe are taken from the price labeler.
class Price_Classifier():
def __init__(self, thresholds, classifier):
self.classifier = classifier
# initialize labeler
self.labeler = Price_LabelHandler(thresholds)
def fit(self, X, y):
# convert prices to IDs
id = y.map(self.labeler.price_to_id)
# fit classifier
self.classifier.fit(X, id)
return self
def predict(self, X):
# predict IDs
id = self.classifier.predict(X)
# convert to labels
return np.array([self.labeler.id_to_label(i) for i in id])
def predict_proba(self, X):
# predict probabilities
probas = self.classifier.predict_proba(X)
# get labels
labels = [self.labeler.id_to_label(i) for i
in self.classifier.classes_]
# return as dataframe
return pd.DataFrame(probas, columns=labels)
Feature Engineering
The following class is a transformer that engineers the features. Thus, this class has a fit method to set up the encoders and a transform method to convert the input data to features that are fed to the predictor.
class FeatureEngineering():
def __init__(self, n_cluster=10):
# set number of clusters
self.n_cluster = n_cluster
def fit(self, data_df, _):
# fit kmeans clustering
self.kmeans = KMeans(n_clusters=self.n_cluster, random_state=0,
n_init='auto')
self.kmeans.fit(data_df[['Latitude', 'Longitude']]))
# predict and one-hot-encode labels
geo_labels = self.kmeans.predict(data_df[['Latitude', 'Longitude']])
self.enc = OneHotEncoder().fit(geo_labels.reshape(-1, 1))
return self
def transform(self, data_df):
# bypass features
feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms',
'Population', 'AveOccup']
feature_df = data_df[feature_cols].reset_index(drop=True)
# number of bedrooms over total number of rooms
feature_df['bedrms_per_room'] = (feature_df['AveBedrms'] /
feature_df['AveRooms'])
# encode geo-location
geo_labels = self.kmeans.predict(data_df[['Latitude', 'Longitude']])
geo_matrix = self.enc.transform(geo_labels.reshape(-1, 1))).toarray()
col_names = [f'Cluster_{i}' for i in range(self.n_cluster)]
cluster_df = pd.DataFrame(geo_matrix, columns=col_names)
return feature_df.join(cluster_df)
The latitude and longitude of the houses are clustered into n_clusters via k-means. These clusters are then one-hot-encoded and added as features. All other input columns are returned as features, additionally the ratio of bedrooms over the number of all rooms is added as another feature.
Pipeline
The last step is to integrate all classes into a pipeline.
class PricePipeline(Pipeline):
def __init__(self, thresholds, classifier, n_cluster=10):
# set attributes
self.thresholds = thresholds
self.n_cluster = n_cluster
self.classifier = classifier
# initialize feature enginering and price classifier
fe = FeatureEngineering(n_cluster)
price_classifier = Price_Classifier(thresholds, classifier)
# define pipline steps
steps = [('transformer', fe), ('model', price_classifier)]
# initialize super class
super(PricePipeline, self).__init__(steps=steps)
Here, a customized pipeline is created by inheriting from sk-learn-pipeline. To initialize the pipeline the list of thresholds, a classifier object and the number of location clusters needs to be provided. Based on that the feature engineering class and the price classifier are initialized. Next, these objects are combined in a list to define the processing steps of the pipeline. Finally, these steps are provided to initialize the super-class (Pipeline). Note that no methods like transform, fit, or predict need to be implemented because these methods are inherited from the super-class.
Demo
In this section, I show how the pricing pipeline is initialized, trained, and used to predict price categories.
First, the California housing price dataset is imported and split into a training and test set.
data = datasets.fetch_california_housing(as_frame=True)
data_df, target = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(data_df, target)
Next, I define two thresholds (1, 2) to define the pricing categories. The housing prices are expressed in hundreds of thousands of dollars, hence the defined pricing categories are:
- <= 100,000$
- 100,000$ β 200,000$
- > 200,000$
We divide the house locations into 10 clusters and use XG Boost as a classifier. These objects are given as input to initialize the pricing pipeline, which is then fitted on the training data.
thresholds = [1, 2]
n_cluster = 10
classifier = XGBClassifier(objective='multi:prob')
# train XGB
xgb_pipe = PricePipeline(thresholds, classifier, n_cluster)
xgb_pipe.fit(X_train, y_train)
The predict function returns the price categories that can be easily interpreted.
print(xgb_pipe.predict(X_test.head()))
['price <= 1' 'price <= 1' '1 < price <= 2' '1 < price <= 2' '2 < price']
These labels are also provided as column names of the prediction probabilities.
print(xgb_pipe.predict_proba(X_test.head()))
price <= 1 1 < price <= 2 2 < price
0 0.921129 0.077801 0.001070
1 0.581553 0.409677 0.008770
2 0.169140 0.728899 0.101961
3 0.005441 0.652810 0.341748
4 0.000224 0.020619 0.979157
A different classifier, like Random Forest, can easily be evaluated by initializing another pricing pipeline with a different classifier object as input.
classifier = RandomForestClassifier()
rf_pipe = PricePipeline(thresholds, classifier, n_cluster)
rf_pipe.fit(X_train, y_train)
print(rf_pipe.predict_proba(X_test.head()))
price <= 1 1 < price <= 2 2 < price
0 0.88 0.12 0.00
1 0.65 0.32 0.03
2 0.13 0.68 0.19
3 0.03 0.50 0.47
4 0.00 0.03 0.97
Conclusion
This was a simple example of how to integrate multiclass-classification and feature engineering in a customized pipeline. The custom pipeline class can be further extended by, for example,
- Recording statistics on training and serving data
- Integrated saving and loading of pipeline objects
- Returning feature importance
- Optimizing feature selection
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI