An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding…

Last Updated on August 5, 2021 by Editorial Team

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding, Label Binarizer, Standardization, Normalization, Robust Scaling, and more

Useful data preprocessing methods with python implementation in one view

The dataset should render suitable for the data trained in Machine Learning and the prediction made by the algorithm to yield more successful results. Each column has a different characteristic and separate methods should be applied to each if necessary. For example, better results are obtained when categorical data is converted to numeric data. The distinction to be made here is to choose the data preprocessing method suitable for the model and the project. This article contains different angles to look at the dataset to make it easier for algorithms to learn the dataset. All studies are made more understandable with python applications.

Table of Contents (TOC)
1. Categorical Variables
1.1. Column Transformers, OneHotEncoding, Dummy Coding
1.2. Label Encoder
1.3. Ordinal Encoder
1.4. Label Binarizer
2. Numeric Variables
2.1. Standardization
2.2. Normalization
2.3. L² Regularization
2.4. Robust Scaler

1. Categorical Variables

The columns in the dataset are ready to be processed by the algorithm, they can be presented continuously (continuous features), or they can be presented without variation continuously, for example, when we consider the iris dataset, a flower is either Iris Setosa, Iris Versicolor or Iris Virginia. It cannot be of any value in the middle of these types. Such datasets are called categorical features and it is necessary to make these columns suitable for the algorithm (convert categorical to numeric data). These methods are explained below and implemented in python.

1.1. Column Transformers, One Hot Encoding, and Dummy Coding

The processes for the categorical column and the numerical column are not the same. With Column Transformers, multiple different preprocessing operations can be performed on the columns in the dataset at the same time.

OneHotEncoding turns each type of categorical data into a column and makes the dataset numeric. For example, in the dataset, it adds 2 new columns, male and female, instead of the male/female option in the sex column, and for female data: the male option assigns 0 and female option 1 and vice versa.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

IN[1]
data=pd.read_csv('toy_dataset.csv')
data

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

IN[2]
ohe=OneHotEncoder(sparse=False)
scaler = MinMaxScaler()
ct=make_column_transformer((ohe,['City','Gender','Illness']),
                      (scaler  ['Income']),remainder='passthrough')
data_ct=ct.fit_transform(data)
data_ct

IN[3]
ohe_True=OneHotEncoder(sparse=True)
ohe_with_sparse=ohe_True.fit_transform(data['Gender'].values.reshape(-1,1)).toarray()
ohe_with_sparse
OUT[3]
array([[0., 1.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [0., 1.],
       [1., 0.]])

The same procedure is done with pandas.get_dummies in Pandas. However, OneHotEncoder is preferred for machine learning studies. Because OneHotEncoder is from the transformer class and is applied with the fit command. Once it is applied, it can transform the next data according to the categories it has learned, and it is easy to use with the pipeline.

IN[4]
dum=pd.get_dummies(data,columns=['Gender','City','Illness'],prefix='',prefix_sep='')

Now let’s look at the OneHotEncoder hyperparameters.

Sparse: Due to the dataset include a lot of “0” values sparse is defined as 0. However, when Sparse=False is set and .toarray() is added, the dataset can be readable.
Handle_unknown: When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

1.2. Label Encoder

Categorical data is turned into numeric starting from 0 and increasing arithmetically. For example, a column containing red, white, blue data; Change 0 corresponds red, 1 corresponds white and 3 corresponds blue.

IN[5]
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
encoded_data_city=le.fit_transform(data['City'])
print("Cities",np.unique(encoded_data_city))
encoded_data_gender=le.fit_transform(data['Gender'])
print("Gender",np.unique(encoded_data_gender))
OUT[5]
Cities [0 1 2 3 4 5 6 7]
Gender [0 1]

1.3. Ordinal Encoder

Ordinal Encoder does the same as Label Encoder. The difference between Ordinal Encoder is used for features although Label Encoder is used Labels(targets) so that Ordinal Encoder fit the data with (number_of_samples, number_of_features) while Label Encoder (number_of_samples)

IN[6]
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder(categories=[dictionary])
encoded_data_city=oe.fit_transform(data[['City']])
print("Cities", np.unique(encoded_data_city))
OUT[6]
Cities [0. 1. 2. 3. 4. 5. 6. 7.]

1.4. Label Binarizer

Label Binarizer has the same functionality as OneHotEncoding. The difference is that OHE is used for multicolumn data while LabelBinarizer is used for a single column only. It is used for a single column in LabelEncoder.

IN[7]
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_city=lb.fit_transform(data['City'])
label_city
OUT[7]
array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

2. Numeric Variables

Rescaling Numeric Variables is very important for the algorithm to give more accurate results. Thanks to Rescale, all numeric data are sorted between a certain range according to certain rules. Let’s take a look at these methods now.

2.1. Standardization

The average value of the column is assigned 0 and all other numeric values are surrounded by variance 1.

IN[8]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_standardscaler=scaler.fit_transform(data[['Income']])
data_standardscaler
OUT[8]
array([[-2.03629388],
       [-1.84753398],
       [-1.55144865],
       ...,
       [ 0.82563756],
       [ 0.82535744],
       [-0.16013972]])

2.2. Normalization

The numeric data in the column are distributed between 0 and 1, with the minimum data being 0 and the maximum data being 1.

IN[9]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_minmaxscaler=scaler.fit_transform(data[['Income']])
data_minmaxscaler
OUT[9}
array([[0.23070001],
       [0.25722818],
       [0.29883978],
       ...,
       [0.6329136 ],
       [0.63287423],
       [0.49437324]])

2.3. L² Regularization

L2 normalization is also known as the Euclidian norm. All data is placed between -1 and 1.

2.4. Robust Scaler

Median value and Inter Quantile Range(IQR) is calculated.

new_x(i) = [x(i) — median_value]/IQR

IN[10]
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
data_robust=scaler.fit_transform(data[['Income']])
data_robust

OUT[10]
array([[-2.2530733 ],
       [-2.05363353],
       [-1.74079594],
       ...,
       [ 0.77078379],
       [ 0.77048782],
       [-0.27076793]])

Back to the guideline click here.

Machine Learning Guideline

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding…

Author(s): Ibrahim Kovan

Data Analysis

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding, Label Binarizer, Standardization, Normalization, Robust Scaling, and more

1. Categorical Variables

1.1. Column Transformers, One Hot Encoding, and Dummy Coding

1.2. Label Encoder

1.3. Ordinal Encoder

1.4. Label Binarizer

2. Numeric Variables

2.1. Standardization

2.2. Normalization

2.3. L² Regularization

2.4. Robust Scaler

Back to the guideline click here.

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding…

Author(s): Ibrahim Kovan

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding, Label Binarizer, Standardization, Normalization, Robust Scaling, and more

1. Categorical Variables

1.1. Column Transformers, One Hot Encoding, and Dummy Coding

1.2. Label Encoder

1.3. Ordinal Encoder

1.4. Label Binarizer

2. Numeric Variables

2.1. Standardization

2.2. Normalization

2.3. L² Regularization

2.4. Robust Scaler

Back to the guideline click here.

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement