Customizing sk-learn Models and Pipelines

Last Updated on January 29, 2024 by Editorial Team

Author(s): Reinhard Sellmair

Originally published on Towards AI.

Sk-learn offers a wide variety of models that can be easily plugged in and tested due to their modular design. Furthermore, modularity also allows the combination of models with multiple pre-processing steps to create a pipeline that processes data to features that are then fed to the model.

In addition to combining prepackaged modules, we can create customized transformers specifically designed for the problem we want to solve. I will show how to:

Integrate customized labeling and feature engineering
Create a multiclass classification wrapper that can integrate any classification model
Combine modules in a customized pipeline

Problem

As an example, I’m using sk-learn’s California housing dataset and creating a multiclass-classification model that predicts adjustable price categories.

Labeling

The dataset contains continuous prices that are converted into categories with respect to the provided thresholds. Each category is labeled by an ID, which the multiclass model is supposed to predict. One reason for rephrasing a regression problem into a classification problem could be that the user wants to focus on a specific price range and requires a model that can predict this range with high accuracy.

The class below uses thresholds to define price categories, converts prices into categories, and reverts categories back to price ranges.

This class is initialized by providing a list of thresholds, which are converted to a list of labels that describe the price range and a list of IDs. The methods price_to_id converts prices into categories and returns the corresponding category ID, and id_to_label reverts a category ID to the corresponding label.

class Price_LabelHandler():
 def __init__(self, thresholds):
 # sort thresholds
 self.thresholds = sorted(thresholds)
 # convert thresholds to labels
 self.labels = [f'price <= {self.thresholds[0]}']
 for low, high in zip(self.thresholds[:-1], self.thresholds[1:]):
 self.labels.append(f'{low} < price <= {high}')
 self.labels.append(f'{high} < price')
 # initialise ids for each class
 self.ids = range(len(self.labels))

 def price_to_id(self, price):
 for threshold, id in zip(self.thresholds, self.ids[:-1]):
 if price <= threshold:
 return id
 return self.ids[-1]

 def id_to_label(self, id):
 return self.labels[id]

Predictor

The price label handler is integrated in the predictor so that transformations from prices to categories and vice versa can be done seamlessly.

To initialize the price classifier a multiclass classifier object and the price thresholds need to be provided. To integrate this predictor into a sk-learn pipeline following methods are implemented. The fit method converts the prices to IDs and fits the classifier, applies the classifier to predict IDs, and converts the IDs to labels. Probabilities are calculated with predict_proba where the column names of the returned dataframe are taken from the price labeler.

class Price_Classifier():
 def __init__(self, thresholds, classifier):
 self.classifier = classifier
 # initialize labeler
 self.labeler = Price_LabelHandler(thresholds)

 def fit(self, X, y):
 # convert prices to IDs
 id = y.map(self.labeler.price_to_id)

 # fit classifier
 self.classifier.fit(X, id)
 return self

 def predict(self, X):
 # predict IDs
 id = self.classifier.predict(X)
 # convert to labels
 return np.array([self.labeler.id_to_label(i) for i in id])

 def predict_proba(self, X):
 # predict probabilities
 probas = self.classifier.predict_proba(X)
 # get labels
 labels = [self.labeler.id_to_label(i) for i 
 in self.classifier.classes_]
 # return as dataframe
 return pd.DataFrame(probas, columns=labels)

Feature Engineering

The following class is a transformer that engineers the features. Thus, this class has a fit method to set up the encoders and a transform method to convert the input data to features that are fed to the predictor.

class FeatureEngineering():
 
 def __init__(self, n_cluster=10):
 # set number of clusters
 self.n_cluster = n_cluster

 def fit(self, data_df, _):
 # fit kmeans clustering
 self.kmeans = KMeans(n_clusters=self.n_cluster, random_state=0, 
 n_init='auto')
 self.kmeans.fit(data_df[['Latitude', 'Longitude']]))
 
 # predict and one-hot-encode labels
 geo_labels = self.kmeans.predict(data_df[['Latitude', 'Longitude']])
 self.enc = OneHotEncoder().fit(geo_labels.reshape(-1, 1))

 return self

 def transform(self, data_df):
 # bypass features
 feature_cols = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 
 'Population', 'AveOccup']
 feature_df = data_df[feature_cols].reset_index(drop=True)
 # number of bedrooms over total number of rooms
 feature_df['bedrms_per_room'] = (feature_df['AveBedrms'] / 
 feature_df['AveRooms'])

 # encode geo-location
 geo_labels = self.kmeans.predict(data_df[['Latitude', 'Longitude']])
 geo_matrix = self.enc.transform(geo_labels.reshape(-1, 1))).toarray()
 col_names = [f'Cluster_{i}' for i in range(self.n_cluster)]
 cluster_df = pd.DataFrame(geo_matrix, columns=col_names)

 return feature_df.join(cluster_df)

The latitude and longitude of the houses are clustered into n_clusters via k-means. These clusters are then one-hot-encoded and added as features. All other input columns are returned as features, additionally the ratio of bedrooms over the number of all rooms is added as another feature.

Pipeline

The last step is to integrate all classes into a pipeline.

class PricePipeline(Pipeline):
 def __init__(self, thresholds, classifier, n_cluster=10):
 # set attributes
 self.thresholds = thresholds
 self.n_cluster = n_cluster
 self.classifier = classifier

 # initialize feature enginering and price classifier
 fe = FeatureEngineering(n_cluster)
 price_classifier = Price_Classifier(thresholds, classifier)

 # define pipline steps
 steps = [('transformer', fe), ('model', price_classifier)]
 # initialize super class
 super(PricePipeline, self).__init__(steps=steps)

Here, a customized pipeline is created by inheriting from sk-learn-pipeline. To initialize the pipeline the list of thresholds, a classifier object and the number of location clusters needs to be provided. Based on that the feature engineering class and the price classifier are initialized. Next, these objects are combined in a list to define the processing steps of the pipeline. Finally, these steps are provided to initialize the super-class (Pipeline). Note that no methods like transform, fit, or predict need to be implemented because these methods are inherited from the super-class.

Demo

In this section, I show how the pricing pipeline is initialized, trained, and used to predict price categories.

First, the California housing price dataset is imported and split into a training and test set.

data = datasets.fetch_california_housing(as_frame=True)
data_df, target = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(data_df, target)

Next, I define two thresholds (1, 2) to define the pricing categories. The housing prices are expressed in hundreds of thousands of dollars, hence the defined pricing categories are:

<= 100,000$
100,000$ — 200,000$
> 200,000$

We divide the house locations into 10 clusters and use XG Boost as a classifier. These objects are given as input to initialize the pricing pipeline, which is then fitted on the training data.

thresholds = [1, 2]
n_cluster = 10
classifier = XGBClassifier(objective='multi:prob')
# train XGB
xgb_pipe = PricePipeline(thresholds, classifier, n_cluster)
xgb_pipe.fit(X_train, y_train)

The predict function returns the price categories that can be easily interpreted.

print(xgb_pipe.predict(X_test.head()))

['price <= 1' 'price <= 1' '1 < price <= 2' '1 < price <= 2' '2 < price']

These labels are also provided as column names of the prediction probabilities.

print(xgb_pipe.predict_proba(X_test.head()))

 price <= 1 1 < price <= 2 2 < price
0 0.921129 0.077801 0.001070
1 0.581553 0.409677 0.008770
2 0.169140 0.728899 0.101961
3 0.005441 0.652810 0.341748
4 0.000224 0.020619 0.979157

A different classifier, like Random Forest, can easily be evaluated by initializing another pricing pipeline with a different classifier object as input.

classifier = RandomForestClassifier()
rf_pipe = PricePipeline(thresholds, classifier, n_cluster)
rf_pipe.fit(X_train, y_train)
print(rf_pipe.predict_proba(X_test.head()))

 price <= 1 1 < price <= 2 2 < price
0 0.88 0.12 0.00
1 0.65 0.32 0.03
2 0.13 0.68 0.19
3 0.03 0.50 0.47
4 0.00 0.03 0.97

Conclusion

This was a simple example of how to integrate multiclass-classification and feature engineering in a customized pipeline. The custom pipeline class can be further extended by, for example,

Recording statistics on training and serving data
Integrated saving and loading of pipeline objects
Returning feature importance
Optimizing feature selection

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Customizing sk-learn Models and Pipelines

Author(s): Reinhard Sellmair

Problem

Labeling

Predictor

Feature Engineering

Pipeline

Demo

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Customizing sk-learn Models and Pipelines

Author(s): Reinhard Sellmair

Problem

Labeling

Predictor

Feature Engineering

Pipeline

Demo

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement