Feature Transformation
Last Updated on April 24, 2022 by Editorial Team
Author(s): Parth Gohil
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
When and which feature transformation to use according toΒ data.
The life cycle of the Machine Learning model can be broken down into the following steps.
- Data Collection
- Data Preprocessing
- Feature Engineering
- Feature Selection
- Model Building
- Hyper Parameter Tuning
- Model Deployment
To build a model we have to preprocess data. Feature Transformation is one of the most important tasks in this process. In the dataset, there will be data with different magnitudes most of the time. So to make our predictions better we have to scale down different features to the same range of magnitude or some specific data distribution. because most of the algorithms will give more importance to the features with high volume rather than giving the same importance to all features. This will lead to wrong predictions and faulty models and we donβt wantΒ that.
When to use Feature Transformation
- In algorithms like K-Nearest- Neighbors, SVM, and K-means which are Distance-based algorithms, They will give more weightage to features with large values because the distance is calculated with values of data points to find out similarities between them. If we feed the algorithm unscaled features the prediction will suffer dramatically.
- Because we are scaling features down, the area of calculations will also be brought down to a smaller range and a little less time will be spent on calculating. Hence faster model training.
- In the algorithms like linear regression and logistic regression which work upon gradient-descent optimization, Feature scaling becomes crucial because if we feed data with different magnitudes, it will be way too hard to converge to Global Minima. With values of the same range, it becomes less burden for algorithms toΒ learn.
When not to use Feature Transformation
Most of the ensemble method doesnβt require feature scaling because even if we perform feature transformation, the depth of distribution wouldnβt change much. So in such algorithms, there is no need for scaling unless specifically needed.
There are so many types of feature transformation methods, we will talk about the most useful and popularΒ ones.
- Standardization
- MinβββMax Scaling/ Normalization
- Robust Scaler
- Logarithmic Transformation
- Reciprocal Transformation
- Square Root Translation
- Box-Cox Transformation
- Johnson transformation
1. Standardization
Standardization should be used when the features of the input dataset have large differences between ranges or when they are measured in different measurements units like Height, Weight, Meters, Miles,Β etc.
We bring all the variables or features to a similar scale. Where the mean is 0 and the Standard Deviation isΒ 1.
In Standardization, we subtract feature values by their mean and then divide by standard deviation which gives exactly standard normal distribution.
2. MinβββMax Scaling / Normalization
In simple terms, min-max scaling brings down feature values to a range of 0 to 1. Until we specify the range we want it to be scaled downΒ to.
In Normalization, we subtract the feature value by its minimum value and then divide it by the range of features (range of feature= maximum value of featureβββminimum value of feature).
3. RobustΒ Scaler
If the dataset has too many outliers, both Standardization and Normalization can be hard to depend on, in such case you can use Robust Scaler for featureΒ scaling.
You can also say Robust Scaler is robust to outliersΒ π.
It scales values using median and interquartile range therefore it doesnβt get affected by very large or very small values of features.
The robust scaler subtracts feature values by their median and then divides by itsΒ IQR.
- 25th percentile = 1stΒ quartile
- 50th percentile = 2nd quartile (also called theΒ median)
- 75th percentile = 3rdΒ quartile
- 100th percentile = 4th quartile (also called theΒ maximum)
- IQR= Inter QuartileΒ Range
- IQR= 3rd quartileβββ1stΒ quartile
4. Gaussian Transformations
Some Machine Learning algorithms like linear and logistic regression assumes that the data we are feeding them is normally distributed. If the data is normally distributed such algorithms tend to perform better and give higher accuracy. Normalizing a skewed distribution becomes an important partΒ here.
But most of the time data would be skewed so You have to transform it into Gaussian distribution using the following algorithms, you might have to try out a few methods before selecting one since different datasets tend to have different requirements and we canβt fit one method to all theΒ data.
We will use this function to plot data throughout the rest of the article. I will be using the age feature only from the titanic dataset to plot histogram and QQΒ Plot.
Below the graph is of age feature before featureΒ scaling
A. Logarithmic Transformation
In the Logarithmic Transformation, we will apply log to all values of features using NumPy and store it in the newΒ feature.
Using Log Transformation doesnβt seem to fit very well in this dataset, it even worsens the distribution by making data left-skewed. So we have to rely on other methods to achieve normal distribution.
B. Reciprocal Transformation
In Reciprocal Transformation, we divide each value of a feature by 1(reciprocal) and store it in the newΒ feature.
Reciprocal Transformation doesnβt work well with this data, It doesnβt give normal distribution instead it made data even more right-skewed.
C. Square Root Translation
In square root transformation, we raise the values of feature to the power of fraction(1/2) to achieve the square root of a value. We can also use NumPy for this transformation.
Square root transformation seems to perform better than reciprocal and log transformation with this data but yet it is a bit left-skewed.
D. Box-Cox Transformation
Box-Cox transformation is one of the most useful scaling techniques to transfer data distribution in a normal distribution.
The Box-Cox transformation can be definedΒ as:
T(Y)=(Y exp(Ξ»)β1)/Ξ»
Where Y is the response variable and Ξ» is the transformation parameter. Ξ» varies from -5 to 5. In the transformation, all values of Ξ» are considered and the optimal value for a given variable is selected.
We can calculate box cox transformation using stats from the SciPyΒ module.
So far box cox transformation seems to be the best fit for the age feature to transform.
Conclusion
There are other techniques also to perform to get Gaussian distribution but most of the time one of these methods seems to fit well on theΒ dataset.
Feature Transformation was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI