Feature Selection in Machine Learning

Last Updated on July 10, 2020 by Editorial Team

In the real world, data is not as clean as it’s often assumed to be. That’s where all the data mining and wrangling comes in; to build insights out of the data that has been structured using queries, and now probably contains certain missing values, and exhibits possible patterns that are unseen to the naked eye. That’s where Machine Learning comes in: To check for patterns and make use of those patterns to predict outcomes using these newly understood relationships in the data.

For one to understand the depth of the algorithm, one needs to read through the variables in the data, and what those variables represent. Understanding this is important, because when you need to prove your outcomes, based on your understanding of the data. If your data contains five, or even fifty variables, let’s say you’re able to go through them all. But what if it contains 200 variables? You don’t have the time to go through each variable. On top of that, various algorithms will not work with categorical data, so you have to convert all the categorical columns to quantitative variables (they look quantitative, but the metrics will justify that they are categorical), to push them into the model. So, this increases the number of variables in your data, and now you’re hanging around with 500 variables. How do you deal with them? You might think that dimensionality reduction is the answer, right away. Dimensionality reduction algorithms will reduce the dimensions, but the interpretability isn’t that good. What if I tell you that there are other techniques, that can eliminate features, and it would still be easy to understand and interpret the retained features?

Depending on whether the analysis is regression or classification based, feature selection techniques can differ/vary but the general idea of how to implement it remains the same.

Here are some Feature Selection techniques to tackle this issue:

1. Highly Correlated Variables

Variables which are highly correlated with each other, give the same information to the model, and hence it becomes unnecessary to include all of them for our analysis. For Example: If a dataset contains a feature “Browsing Time”, and another called “Data Used while Browsing”, then you can imagine that these two variables will be correlated to some extent, and we would see this high correlation even if we pick up an unbiased sample of the data. In such a case, we would require only one of these variables to be present as a predictor in the model, because if we use both, then the model will over-fit and become biased towards this particular feature(s).

2. P-Values

In algorithms like Linear Regression, an initial statistical model is always a good idea, as it helps in visualizing the importance of features, with the use of their P-values that have been obtained using that model. On setting a level of significance, we check for the P-values obtained, and if this value is less than the level of significance, it shows that the feature is significant, i.e. a change in this value is likely to show a change in the value of the Target.

3. Forward Selection

Forward Selection is a technique that involves the use of step-wise regression. So, the model starts building from ground zero, i.e. an empty model, and then each iteration adds a variable such that there is an improvement in the model being built. The variable to be added in each iteration is determined using its significance, and that can be calculated using various metrics, with a common one being the P-value obtained from an initial statistical model built using all the variables. At times, Forward Selection can cause an over-fit because it can add highly correlated variables to the model, even when they provide the same data to the model (but the model shows an improvement).

4. Backward Elimination

Backward Elimination too, involves step-wise feature selection, in a way that’s opposite to that of the Forward Selection. In this case, the initial model starts out with all the independent variables, and one by one, these variables are eliminated (one per iteration), if they don’t provide value to the newly formed regression model in each iteration. This is again, based on the P-values obtained using the initial statistical model, and based on these P-values, the features are eliminated from the model. Using this method as well, there is an uncertainty in the removal of highly correlated variables.

5. Recursive Feature Elimination (RFE)

RFE is a widely used technique/algorithm to select an exact number of significant features, sometimes to explain a particular number of “most important” features impacting the business, and sometimes as a method to reduce a very high number of features (say around 200–400) down to only the ones that create even a bit of impact on the model, and eliminating the rest. RFE uses a rank-based system, to show ranks of the features in the dataset, and these ranks are used to eliminate features in a recursive loop, based on the collinearity present among them, and of course, the significance of these features in the model. Apart from ranking the features, RFE can show whether these features are important or not, even for the selected number of features (because it is very much possible that the selected number, that we chose, may not represent the optimal number of important features, and that the optimal number of features may be more or less than this number chosen by the user).

6. Charted Feature Importance

When we talk about the interpretability of machine learning algorithms, we usually discuss on linear regression (as we can analyze feature importance using the P-values) and decision tree (which practically shows the feature importance in the form of a tree, which shows the hierarchy of importance as well), but on the other hand, often we use the variable importance chart, to plot the variables and the “amount of their importance”, in algorithms such as Random Forest Classifier, Light Gradient Boosting Machine, and XG Boost. This is particularly useful when well-structured importance of features needs to be presented to a business that is being analyzed.

7. Regularization

Regularization is done to monitor the trade-off between bias and variance. Bias tells how much the model has over-fitted on the training data-set. Variance tells us how different were the predictions made on training and testing data-sets. Ideally, both bias and variance need to be reduced. Regularization comes to save the day here! There are mainly two types of regularization techniques:

L1 Regularization – Lasso: Lasso penalizes the model’s beta coefficients to change their importance in the model, and may even ground them (turn them into zeros, i.e. basically remove these variables from the final model). Generally, Lasso is used when you observe that your data-set has a large number of variables, and you need to remove some of them for a better understanding of how the important features affect your model (i.e the features which are finally selected by Lasso, and their importance is assigned).

L2 Regularization – Ridge: The function of Ridge is to maintain all the variables, i.e use all the variables to build the model, and at the same time, assign them importance such that there is an improvement in the model performance. Ridge is a great choice when the number of variables in the data-set is low, and hence all of those variables are required to interpret the insights and predicted target results obtained.

Since Ridge keeps all the variables intact, and Lasso does a better job at assigning importance to the variables, a combination of both, known as Elastic-Net was developed as a way to develop an algorithm, by combining the best features of Ridge and Lasso. Elastic-Net becomes the ideal choice in that way.

There are more ways to select features while performing machine learning, but the base idea usually remains the same: Showcasing the feature importance and then eliminating variables based on the obtained “importance”. The importance here is a very subjective term since it is not one metric, but a collection of metrics and graphs, that can be used to check for the most important features.

Thank you for reading! Happy learning!

Feature Selection in Machine Learning was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication