Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Regression/Classification Basic Pipeline
Data Science

Regression/Classification Basic Pipeline

Last Updated on June 28, 2020 by Editorial Team

Author(s): Shaurya Lalwani

Data Science

Photo by Andy Kelly onΒ Unsplash

This article explains the important parts of a Regression/Classification Pipeline (the differences have been shown wherever required). Additional points can be added based on the domain and industry you’re working for. Generally, model deployment and cloud integration follow this process, but that’s not what we’re talking aboutΒ today.

Another point which has not been highlighted as such below is the β€œdata cleaning” to be done before it is wrangled and mined, as this is probably the most important part of diving into analytics: Cleaning the data, and transforming it, such that it makes sense, and such that all the anomalies are caught, before it is put into the pre-processing stage.

Here’s the Pipeline:

1. Collect data from varied sources, and combine(concatenate/merge) datasets(if multiple)

2. Read the dataset, and check for the features. Understand the features first(let’s say understand how each feature is related to the target, i.e., β€œcredit score,” for example.)

3. Check for null values, and β€œdescribe” the dataset, as in understanding the datatypes and how they are spread in the dataset that we have obtained/collected

4. Decide how to treat null values. This highly depends on the business case at hand, because it often happens that even columns with more than 70% nulls aren’t imputed, but still kept as valuable information by turning them into dummies(an indicator of the presence ofΒ data)

5. Check for outliers, and the outlier is a holistic term, so even when a variable may show outliers with the naked eye, it is important to understand that it may not always contain outliers as such because the feature’s understanding will determine whether we see a distant(low occurring/special) value or an outlier: Imagine the classic case of house prices, where we see extremely high houseΒ prices

6. After outlier treatment(i.e., removing outliers/working with them), we move on feature transformation if required. Some algorithms become biased to features having much higher values than other features, and this mostly happens in a few classification algorithms. Hence, sometimes we do need to transform features. Another reason to transform features could be to include outliers (log transformation forΒ example)

7. Now, finally, we move on to the model building. We can start by breaking the data into train and test cases, and then training the train data. In case of linear regression, I would prefer to start with statistical modeling (to understand features by seeing the related p-values) and decision tree in the case of classification(again, to visualize the important features, which have been used to split nodes at eachΒ depth)

8. After the initial algorithms, one can either try out other algorithms(to improve accuracy/score) or try feature selection using techniques such as Correlation Heat-maps / VIF(Remove highly correlated variables in short, as they provide the same information to the model), Backward Elimination/Recursive Elimination(directly select important features based on p-values obtained in the Statistical model).

9. We are just starting up the model building process at this moment because now we are approaching the time which we’ll spend comparing Rsquares, RMSEs in case of regression, and Confusion Matrices, Sensitivity, Specificity, F1 score, AUC-ROC curve and AUC in case of classification.

10. At this moment, some analytics professionals also try something called as β€œPolynomial Features” which is a very powerful technique to check for the interaction within and across the features in the dataset, and when you run a feature elimination algorithm on this dataset of all the interactive features, you obtain a set of very impressively variant features, out of which you can select the strongest ones, and the best is that most of these features would have been obtained as interactions(which explains so much more about theΒ data!)

11. Another very important thing is Regularization, to combat the bias-variance trade-off. Lasso will penalize beta coefficients in a way such that their importance can be increased/decreased or even reduced to zero(kind of like a feature elimination technique, but still very different). Ridge will not remove any variable, but it will penalize coefficients, so it will be useful where we have very less number of features(or a domain where all features are needed to be presented as a business case understanding/outcome), so it will penalize beta coefficients but keep all of them intact for theΒ model.

The base of the pipeline will remain the same, but additional methods can be used as and when you acquire the domain knowledge in the field that you are working in or wanting to work in. In classification, hyper-parameter tuning is also very important, so that you can build various instances of a base algorithm, by changing how data flows in and out of the algorithm, and how it reacts to that dataΒ flow.


Regression/Classification Basic Pipeline was originally published in Towards AIβ€Šβ€”β€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓