Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding…
Latest

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding…

Last Updated on August 5, 2021 by Editorial Team

Author(s): Ibrahim Kovan

Data Analysis

An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding, Label Binarizer, Standardization, Normalization, Robust Scaling, and more

Useful data preprocessing methods with python implementation in one view

The dataset should render suitable for the data trained in Machine Learning and the prediction made by the algorithm to yield more successful results. Each column has a different characteristic and separate methods should be applied to each if necessary. For example, better results are obtained when categorical data is converted to numeric data. The distinction to be made here is to choose the data preprocessing method suitable for the model and the project. This article contains different angles to look at the dataset to make it easier for algorithms to learn the dataset. All studies are made more understandable with python applications.

Table of Contents (TOC)
1. Categorical Variables
1.1. Column Transformers, OneHotEncoding, Dummy Coding
1.2. Label Encoder
1.3. Ordinal Encoder
1.4. Label Binarizer
2. Numeric Variables
2.1. Standardization
2.2. Normalization
2.3. L² Regularization
2.4. Robust Scaler
Photo by Gaelle Marcel on Unsplash

1. Categorical Variables

The columns in the dataset are ready to be processed by the algorithm, they can be presented continuously (continuous features), or they can be presented without variation continuously, for example, when we consider the iris dataset, a flower is either Iris Setosa, Iris Versicolor or Iris Virginia. It cannot be of any value in the middle of these types. Such datasets are called categorical features and it is necessary to make these columns suitable for the algorithm (convert categorical to numeric data). These methods are explained below and implemented in python.

1.1. Column Transformers, One Hot Encoding, and Dummy Coding

The processes for the categorical column and the numerical column are not the same. With Column Transformers, multiple different preprocessing operations can be performed on the columns in the dataset at the same time.

OneHotEncoding turns each type of categorical data into a column and makes the dataset numeric. For example, in the dataset, it adds 2 new columns, male and female, instead of the male/female option in the sex column, and for female data: the male option assigns 0 and female option 1 and vice versa.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
IN[1]
data=pd.read_csv('toy_dataset.csv')
data
Figure 1. OUT[1], Image by author
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
IN[2]
ohe=OneHotEncoder(sparse=False)
scaler = MinMaxScaler()
ct=make_column_transformer((ohe,['City','Gender','Illness']),
(scaler ['Income']),remainder='passthrough')
data_ct=ct.fit_transform(data)
data_ct
Figure 2. OUT[2], Image by author
IN[3]
ohe_True=OneHotEncoder(sparse=True)
ohe_with_sparse=ohe_True.fit_transform(data['Gender'].values.reshape(-1,1)).toarray()
ohe_with_sparse
OUT[3]
array([[0., 1.],
[0., 1.],
[0., 1.],
...,
[0., 1.],
[0., 1.],
[1., 0.]])

The same procedure is done with pandas.get_dummies in Pandas. However, OneHotEncoder is preferred for machine learning studies. Because OneHotEncoder is from the transformer class and is applied with the fit command. Once it is applied, it can transform the next data according to the categories it has learned, and it is easy to use with the pipeline.

IN[4]
dum=pd.get_dummies(data,columns=['Gender','City','Illness'],prefix='',prefix_sep='')
Figure 3. OUT[4], Image by author

Now let’s look at the OneHotEncoder hyperparameters.

  • Sparse: Due to the dataset include a lot of “0” values sparse is defined as 0. However, when Sparse=False is set and .toarray() is added, the dataset can be readable.
  • Handle_unknown: When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.

1.2. Label Encoder

Categorical data is turned into numeric starting from 0 and increasing arithmetically. For example, a column containing red, white, blue data; Change 0 corresponds red, 1 corresponds white and 3 corresponds blue.

IN[5]
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
encoded_data_city=le.fit_transform(data['City'])
print("Cities",np.unique(encoded_data_city))
encoded_data_gender=le.fit_transform(data['Gender'])
print("Gender",np.unique(encoded_data_gender))
OUT[5]
Cities [0 1 2 3 4 5 6 7]
Gender [0 1]

1.3. Ordinal Encoder

Ordinal Encoder does the same as Label Encoder. The difference between Ordinal Encoder is used for features although Label Encoder is used Labels(targets) so that Ordinal Encoder fit the data with (number_of_samples, number_of_features) while Label Encoder (number_of_samples)

IN[6]
from sklearn.preprocessing import OrdinalEncoder
oe=OrdinalEncoder(categories=[dictionary])
encoded_data_city=oe.fit_transform(data[['City']])
print("Cities", np.unique(encoded_data_city))
OUT[6]
Cities [0. 1. 2. 3. 4. 5. 6. 7.]

1.4. Label Binarizer

Label Binarizer has the same functionality as OneHotEncoding. The difference is that OHE is used for multicolumn data while LabelBinarizer is used for a single column only. It is used for a single column in LabelEncoder.

IN[7]
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_city=lb.fit_transform(data['City'])
label_city
OUT[7]
array([[0, 0, 1, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
...,
[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])

2. Numeric Variables

Rescaling Numeric Variables is very important for the algorithm to give more accurate results. Thanks to Rescale, all numeric data are sorted between a certain range according to certain rules. Let’s take a look at these methods now.

2.1. Standardization

The average value of the column is assigned 0 and all other numeric values are surrounded by variance 1.

Figure 4. Standardization, source
IN[8]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_standardscaler=scaler.fit_transform(data[['Income']])
data_standardscaler
OUT[8]
array([[-2.03629388],
[-1.84753398],
[-1.55144865],
...,
[ 0.82563756],
[ 0.82535744],
[-0.16013972]])

2.2. Normalization

The numeric data in the column are distributed between 0 and 1, with the minimum data being 0 and the maximum data being 1.

Figure 5. Normalization, source
IN[9]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_minmaxscaler=scaler.fit_transform(data[['Income']])
data_minmaxscaler
OUT[9}
array([[0.23070001],
[0.25722818],
[0.29883978],
...,
[0.6329136 ],
[0.63287423],
[0.49437324]])

2.3. L² Regularization

L2 normalization is also known as the Euclidian norm. All data is placed between -1 and 1.

Figure 6. L2 normalization, source

2.4. Robust Scaler

Median value and Inter Quantile Range(IQR) is calculated.

new_x(i) = [x(i) — median_value]/IQR

IN[10]
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
data_robust=scaler.fit_transform(data[['Income']])
data_robust
OUT[10]
array([[-2.2530733 ],
[-2.05363353],
[-1.74079594],
...,
[ 0.77078379],
[ 0.77048782],
[-0.27076793]])

Back to the guideline click here.

Machine Learning Guideline


An Overview of Data Preprocessing: Converting Variables — Column Transformers, OneHotEncoding… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓