Regression/Classification Basic Pipeline

Last Updated on June 28, 2020 by Editorial Team

This article explains the important parts of a Regression/Classification Pipeline (the differences have been shown wherever required). Additional points can be added based on the domain and industry you’re working for. Generally, model deployment and cloud integration follow this process, but that’s not what we’re talking about today.

Another point which has not been highlighted as such below is the “data cleaning” to be done before it is wrangled and mined, as this is probably the most important part of diving into analytics: Cleaning the data, and transforming it, such that it makes sense, and such that all the anomalies are caught, before it is put into the pre-processing stage.

Here’s the Pipeline:

1. Collect data from varied sources, and combine(concatenate/merge) datasets(if multiple)

2. Read the dataset, and check for the features. Understand the features first(let’s say understand how each feature is related to the target, i.e., “credit score,” for example.)

3. Check for null values, and “describe” the dataset, as in understanding the datatypes and how they are spread in the dataset that we have obtained/collected

4. Decide how to treat null values. This highly depends on the business case at hand, because it often happens that even columns with more than 70% nulls aren’t imputed, but still kept as valuable information by turning them into dummies(an indicator of the presence of data)

5. Check for outliers, and the outlier is a holistic term, so even when a variable may show outliers with the naked eye, it is important to understand that it may not always contain outliers as such because the feature’s understanding will determine whether we see a distant(low occurring/special) value or an outlier: Imagine the classic case of house prices, where we see extremely high house prices

6. After outlier treatment(i.e., removing outliers/working with them), we move on feature transformation if required. Some algorithms become biased to features having much higher values than other features, and this mostly happens in a few classification algorithms. Hence, sometimes we do need to transform features. Another reason to transform features could be to include outliers (log transformation for example)

7. Now, finally, we move on to the model building. We can start by breaking the data into train and test cases, and then training the train data. In case of linear regression, I would prefer to start with statistical modeling (to understand features by seeing the related p-values) and decision tree in the case of classification(again, to visualize the important features, which have been used to split nodes at each depth)

8. After the initial algorithms, one can either try out other algorithms(to improve accuracy/score) or try feature selection using techniques such as Correlation Heat-maps / VIF(Remove highly correlated variables in short, as they provide the same information to the model), Backward Elimination/Recursive Elimination(directly select important features based on p-values obtained in the Statistical model).

9. We are just starting up the model building process at this moment because now we are approaching the time which we’ll spend comparing Rsquares, RMSEs in case of regression, and Confusion Matrices, Sensitivity, Specificity, F1 score, AUC-ROC curve and AUC in case of classification.

10. At this moment, some analytics professionals also try something called as “Polynomial Features” which is a very powerful technique to check for the interaction within and across the features in the dataset, and when you run a feature elimination algorithm on this dataset of all the interactive features, you obtain a set of very impressively variant features, out of which you can select the strongest ones, and the best is that most of these features would have been obtained as interactions(which explains so much more about the data!)

11. Another very important thing is Regularization, to combat the bias-variance trade-off. Lasso will penalize beta coefficients in a way such that their importance can be increased/decreased or even reduced to zero(kind of like a feature elimination technique, but still very different). Ridge will not remove any variable, but it will penalize coefficients, so it will be useful where we have very less number of features(or a domain where all features are needed to be presented as a business case understanding/outcome), so it will penalize beta coefficients but keep all of them intact for the model.

The base of the pipeline will remain the same, but additional methods can be used as and when you acquire the domain knowledge in the field that you are working in or wanting to work in. In classification, hyper-parameter tuning is also very important, so that you can build various instances of a base algorithm, by changing how data flows in and out of the algorithm, and how it reacts to that data flow.

Regression/Classification Basic Pipeline was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Regression/Classification Basic Pipeline

Author(s): Shaurya Lalwani

Data Science

Here’s the Pipeline:

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Arbitration for AI: A New Frontier in Governing Uncensored Models

Fine-Tuning vs Distillation vs Transfer Learning: What’s The Difference?

#63: Full of Frameworks: APDTFlow, NSGM, MLFlow, and more!

Vector Databases 101: A Beginner’s Guide to Vector Search and Indexing

AI Agent Developer: A Journey Through Code, Creativity, and Curiosity

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Regression/Classification Basic Pipeline

Author(s): Shaurya Lalwani

Here’s the Pipeline:

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement