Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Encoding Categorical Data- The Right Way
Latest

Encoding Categorical Data- The Right Way

Last Updated on October 18, 2022 by Editorial Team

Author(s): Gowtham S R

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Photo from Unsplash by Icons8Β Team
Table of Contents
Β· Types of Data
∘ Continuous Data
∘ Discrete Data
∘ Nominal Data
∘ Ordinal Data
Β· How to Encode Categorical data?
∘ Ordinal Encoding
∘ Nominal Encoding
∘ OneHotEncoding using Pandas
∘ Dummy Variable Trap
∘ OneHotEncoding using Sklearn

Types ofΒ Data

In statistics and machine learning, we classify data into one of two types namelyβ€Šβ€”β€ŠNumerical and Categorical

Numerical data is categorized into discrete and continuous data. Categorical data is divided into two types, nominal data, and ordinalΒ data.

Image by the author explaining Types ofΒ Data

Continuous Data

Continuous data is data that takes any number(it can include decimal values). It has an infinite number of probable values that can be selected within a given specificΒ range.

Example: Weight and height of a person, temperature, age, distance.

Image byΒ author

Discrete Data

Discrete data can take only discrete values(it can not have decimal values). Discrete information contains only a finite number of possible values. Those values cannot be subdivided meaningfully. Here, things can be counted in wholeΒ numbers.

Example: Number of people in a family, number of bank accounts a person can have, number of students in a classroom, number of goals scored in a footballΒ match.

Image byΒ author

How Should We Detect and Treat the Outliers?

Nominal Data

Nominal data is defined as data that is used for naming or labeling variables without any quantitative value. It is sometimes called β€œnamed” data.

There is usually no intrinsic ordering to nominal data. For example, Gender is a nominal variable having 2 categories, but there is no specific way to order from highest to lowest and viceΒ versa.

Examples: letters, symbols, words, gender, color,Β etc.

Image byΒ author

Ordinal Data

Ordinal data is a type of data that follows a natural order. It is a type of categorical data with an order. The variables in ordinal data are listed in an orderedΒ manner.

Examples: Customer rating(Good, Average, Bad), Medal category in Olympics(Gold, Silver,Β Bronze)

Image byΒ author

Simple ways to write Complex Patterns in Python in just 4mins.

How to Encode Categorical data?

Encoding categorical data is a process of converting categorical data into integer format so that the data can be provided to different models.

Categorical data will be in the form of strings or object data types. But, machine learning or deep learning algorithms can work only on numbers. So, being machine learning engineers, it is our duty to convert categorical data to numericΒ form.

Ordinal Encoding

We need to maintain the intrinsic order while converting ordinal data to numeric data. So, each category will be given numbers from 0 to a number of categories. If we have 3 categories in the data, such as 'bad', 'average', and 'good', then bad will be encoded as 0, average as 1, and good as 2. So that the order is maintained.

When we train the machine learning model, it will learn the pattern with the intrinsic order giving betterΒ results.

We make use of OrdinalEncoder class from sklearn to encode the ordinalΒ data.

Standardization vs Normalization

Let us see how we can encode ordinal data practically.

import pandas as pd
import numpy as np
df = pd.read_csv('customer.csv')
df.head()
df['age'].unique()
array([30, 68, 70, 72, 16, 31, 18, 60, 65, 74, 98, 51, 57, 15, 75, 59, 22,19, 97, 32, 96, 53, 69, 48, 83, 73, 92, 89, 86, 34, 94, 45, 76, 39,23, 27, 77, 61, 64, 38, 25], dtype=int64)
df['gender'].unique()
array(['Female', 'Male'], dtype=object)
df['review'].unique()
array(['Average', 'Poor', 'Good'], dtype=object)
df['education'].unique()
array(['School', 'UG', 'PG'], dtype=object)
df['purchased'].unique()
array(['No', 'Yes'], dtype=object)

β€˜ageβ€™β€Šβ€”β€Šnumerical data

β€˜genderβ€™β€Šβ€”β€Šnominal categorical data

β€˜reviewβ€™β€Šβ€”β€Šordinal categorical data

β€˜educationβ€™β€Šβ€”β€Šordinal categorical data

β€˜purchasedβ€™β€Šβ€”β€Šthe targetΒ variable

Let us separate ordinal data and learn how to encodeΒ them.

df = df.iloc[:,2:]
df.head()
X = df.iloc[:,0:2]
y = df.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2, random_state=1)
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

In the above code, we can see that we are passing the categories in the order to the ordinalEncoder class, with poor being the lowest order and Good being the highest order in the case of featureΒ review.

Similarly, in the case of the education column, school is the lowest order, and PG is in the highestΒ order.

After transformation, we can observe the above result, which shows the first 5Β rows.

The first row had review = Average and education = UG, so it is transformed asΒ [1,1]

The second row had review = Poor and education = PG, so it is transformed asΒ [0,2]

Nominal Encoding

Nominal data will not have intrinsic order, we make use of OneHotEncoder class from sklearn to encode the nominalΒ data.

If we start encoding categories from 0,1,2 to all the categories, then the machine learning algorithm will give importance to 2 (more than 0 and 1), which will not be true as the data is nominal and will have no intrinsic order. So, this method will not work in the case of nominalΒ data.

Here each of the categories will be transformed into a new column and will be given values of 0 or 1 depending on the occurrence of the respective category. So, the number of features will increase after the transformation.

Let us see how we can encode nominal data practically. Let us take a titanic dataset with features β€˜Sex’ and β€˜Embarked’ (both are nominal features).

import pandas as pd
import numpy as np
df = pd.read_csv('titanic.csv')
df.head()
df['Sex'].unique()
array(['male', 'female'], dtype=object)
df['Embarked'].unique()
array(['S', 'C', 'Q'], dtype=object)
df['Survived'].unique()
array([0, 1], dtype=int64)

OneHotEncoding usingΒ Pandas

pd.get_dummies(df,columns=['Sex','Embarked'])

The below output shows how each of the categories is transformed into a column. The column β€˜Sex’ has got transformed into β€˜Sex_female’ and β€˜Sex_male’ and the column β€˜Embarked’ has got transformed into β€˜Embarked_C’, β€˜Embarked_Q’, and β€˜Embarked_S’.

These new attributes/columns created are called Dummy Variables.

Dummy VariableΒ Trap:

The Dummy variable trap is a scenario where there are attributes that are highly correlated (Multicollinear), and one variable predicts the value of others. When we use one-hot encoding for handling the categorical data, then one dummy variable (attribute) can be predicted with the help of other dummy variables. Hence, one dummy variable is highly correlated with other dummy variables.

Using all dummy variables for models leads to a dummy variable trap (Some of the machine learning algorithms, like Linear Regression, Logistic Regression suffer from multicollinearity)

Why Is Multicollinearity A Problem?

For Exampleβ€Šβ€”β€Š
Let’s consider the case of gender having two values male (0 or 1) and female (1 or 0). Including both dummy variables can cause redundancy because if a person is not male in such case, that person will be female, hence, we don’t need to use both variables in regression models. This will protect us from the dummy variableΒ trap.

How do I Verify the Assumptions of Linear Regression?

So, the models should be designed to exclude one dummy variable. Usually, we drop the first variable as shownΒ below.

pd.get_dummies(df,columns=[β€˜Sex’,’Embarked’],drop_first=True)

The result shows that the first variable has been dropped. Sex_female and Embarked_C are the two columns dropped from theΒ result.

OneHotEncoding usingΒ Sklearn

We can use OneHotEncoder class from sklearn to perform the aboveΒ steps.

X = df.iloc[:,0:2]
y = df.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
X_train_new = ohe.fit_transform(X_train)
X_test_new = ohe.transform(X_test)

Note that new features will be created according to alphabetical order. In the given an example β€˜Sex’ column was transformed to Sex_female and Sex_male alphabetically, and the female got dropped as it was the firstΒ column.

And In the β€˜Embarked’ feature, Embarked_C, Embarked_Q, and Embarked_S were the new columns according to alphabetical order, and category β€˜C’ has got dropped, so the array [0,0] will represent C. [1,0] represents β€˜Q’ and [0,1] represents β€˜S’.

5 Regression Metrics Explained in Just 5mins


Encoding Categorical Data- The Right Way was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓