Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

# From Raw to Refined: A Journey Through Data Preprocessing — Part 1: Feature Scaling

Last Updated on August 7, 2023 by Editorial Team

#### Author(s): Shivamshinde

Originally published on Towards AI.

Sometimes, the data we receive for our machine learning tasks isn’t in a suitable format for coding with Scikit-Learn or other machine learning libraries. As a result, we have to process the data to transform it into the desired format.

There could be various issues with raw data. According to the nature of the issue, we need to use appropriate methods to deal with it.

Let’s see some of these methods and how to implement them in a code.

Mean Removal and Variance Scaling (Standardization)

Scikit-Learn estimators (Estimators refer to the Scikit-Learn classes used to train the machine learning models) are tuned to work best with the standard normally distributed data, i.e., a Gaussian distribution with zero mean and unit variance.

Raw data may not always be in the Gaussian distribution and hence the models trained on this data might end up giving sub-optimal results. Standardization operation could be the solution to this problem.

The standardization is performed using the below formula:

First, the mean and standard deviation is calculated for a column. Then, the mean is subtracted from every data point in that column. Lastly, the result of subtractions is divided by the standard deviation.

Let’s see how to implement this in a code.

For the demonstration, let’s use the famous ‘Tips’ dataset. The ‘Tips’ dataset is used to predict the tips received by waiters based on different factors such as total bill, gender of the customer, day of the week, time of the day, etc.

`## Importing required librariesimport warningswarnings.filterwarnings('ignore')import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline`
`## Loading the datadf = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')## Checking out some of the rows of datadf.head()`

First, we need to separate the dependent feature, which is ‘total_bill’, from the independent features. After that, we need to split the data into training and testing sets.

`## Importing required methodfrom sklearn.model_selection import train_test_split## Separating dependent and independent featuresX, y = df.drop('total_bill', axis=1), df['total_bill']## Separating training and testing dataX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)`

Let us perform standard scaling on the ‘tip’ column.

`from sklearn.preprocessing import StandardScaler## calculating mean and standard deviation of the columnscaler = StandardScaler().fit(np.array(X_train['tip']).reshape(-1,1))## transforming the data using calculated mean and standard deviationtips_transformed = scaler.transform(np.array(X_test['tip']).reshape(-1,1))tips_transformed[:10]`

The ‘reshape’ method is used to convert the 1D array into a 2D array since the fit and transform method requires a 2D array as input.

Scaling features to a range (MinMaxScaler and MaxAbsScaler)

This is another approach to standardization in which features are scaled to lie between the minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size.

If we want to scale our data to lie between ‘min’ and ‘max’ values using MinMaxScaler, the following formula is used:

MinMaxScaler:

`from sklearn.preprocessing import MinMaxScaler## calculating mean and standard deviation of the columnmm_scaler = MinMaxScaler().fit(np.array(X_train['tip']).reshape(-1,1))## transforming the data using calculated mean and standard deviationtips_transformed = mm_scaler.transform(np.array(X_test['tip']).reshape(-1,1))tips_transformed[:10]`

MaxAbsScaler works in a similar way, but it scales the data such that each value lies in the range [-1, 1]. This is done by dividing each value by the maximum value in each of the features.

Centering the data destroys the inherent sparseness of the data and generally, it is not an appropriate thing to do. But in a case where the features are on different scales, it makes sense to scale sparse inputs. MaxAbsScaler was specifically designed for the purpose of scaling sparse data and it is a recommended way.

MaxAbsScaler:

`from sklearn.preprocessing import MaxAbsScaler## calculating mean and standard deviation of the columnma_scaler = MaxAbsScaler().fit(np.array(X_train['tip']).reshape(-1,1))## transforming the data using calculated mean and standard deviationtips_transformed = ma_scaler.transform(np.array(X_test['tip']).reshape(-1,1))tips_transformed[:10]`

Scaling data with outliers (RobustScaler)

If the data is infested with outliers, then the value of the mean and standard deviation could get skewed. In this case, mean and std. deviation values won’t represent the data’s center or data’s spread correctly. Therefore, using mean and std. deviation for scaling when the data has outliers would not work well.

To circumvent this issue, we can use RobustScaler, which uses more robust estimates for the center and range of the data.

`from sklearn.preprocessing import RobustScaler## calculating mean and standard deviation of the columnr_scaler = RobustScaler().fit(np.array(X_train['tip']).reshape(-1,1))## transforming the data using calculated mean and standard deviationtips_transformed = r_scaler.transform(np.array(X_test['tip']).reshape(-1,1))tips_transformed[:10]`

Mapping to a Uniform Distribution (QuantileTransformer)

QuantileTransformer can be used to map the data to a uniform distribution with values between 0 and 1.

`from sklearn.preprocessing import QuantileTransformerq_transformer = QuantileTransformer().fit(np.array(X_train['tip']).reshape(-1,1))xtrain_transformed = q_transformer.transform(np.array(X_train['tip']).reshape(-1,1))`

Now, let’s visualize the ‘tip’ column before and after the transformation.

`dataframe = pd.DataFrame()dataframe['tips_given'] = X_train['tip']dataframe['tips_uniformDist'] = xtrain_transformedsns.set_style('darkgrid')plt.figure(figsize=(10,5))for index, feature in enumerate(dataframe.columns): plt.subplot(1,2,index+1) sns.distplot(dataframe[feature],kde=True, color='b') plt.xlabel(feature) plt.ylabel('distribution') plt.title(f"{feature} distribution")plt.tight_layout()`

Mapping to a Gaussian Distribution (PowerTransformer)

We can use the PowerTransformer to map the data to a distribution as close as to a Gaussian distribution.

We can choose from two methods that are used to make this transformation.

1. Box-cox transform
2. Yeo-Johnson transform

Note that the box-cox transformation can only be applied to the positive data.

box-cox transform:

`from sklearn.preprocessing import PowerTransformerp_transformer = PowerTransformer(method='box-cox').fit(np.array(X_train['tip']).reshape(-1,1))xtrain_transformed = p_transformer.transform(np.array(X_train['tip']).reshape(-1,1))`
`dataframe = pd.DataFrame()dataframe['tips_given'] = X_train['tip']dataframe['tips_NormalDist'] = xtrain_transformedsns.set_style('darkgrid')plt.figure(figsize=(10,5))for index, feature in enumerate(dataframe.columns): plt.subplot(1,2,index+1) sns.distplot(dataframe[feature],kde=True, color='y') plt.xlabel(feature) plt.ylabel('distribution') plt.title(f"{feature} distribution")plt.tight_layout()`

Yeo-Johnson transform:

`from sklearn.preprocessing import PowerTransformerp_transformer = PowerTransformer(method='yeo-johnson').fit(np.array(X_train['tip']).reshape(-1,1))xtrain_transformed = p_transformer.transform(np.array(X_train['tip']).reshape(-1,1))`
`dataframe = pd.DataFrame()dataframe['tips_given'] = X_train['tip']dataframe['tips_NormalDist'] = xtrain_transformedsns.set_style('darkgrid')plt.figure(figsize=(10,5))for index, feature in enumerate(dataframe.columns): plt.subplot(1,2,index+1) sns.distplot(dataframe[feature],kde=True, color='g') plt.xlabel(feature) plt.ylabel('distribution') plt.title(f"{feature} distribution")plt.tight_layout()`

Thanks for reading! If you have any thoughts on the article, then please let me know.

Are you struggling to choose what to read next? Don’t worry, I know of an article that I believe you will find interesting.

## From Many to Few: Tackling High-Dimensional Data with Dimensionality Reduction in Machine Learning

### This article will discuss the curse of dimensionality in machine learning problems and dimensionality reduction as a…

pub.towardsai.net

and one more…

## Striking the Right Balance: Understanding Underfitting and Overfitting in Machine Learning Models

### This article will explain the basic concept of overfitting and underfitting from the machine learning and deep learning…

pub.towardsai.net

## Shivam Shinde

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI