Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Effective Categorical Variable Encoding for Machine Learning
Latest

Effective Categorical Variable Encoding for Machine Learning

Last Updated on January 7, 2023 by Editorial Team

Author(s): Filipe Filardi

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Image by DCStudio onΒ Freepik

A categorical variable is a common type of data found in many machine learning datasets. Effective handling of categorical variables can be crucial for building successful models since it contains rich information that can be used to predict outcomes in Machine Learning.

However, working with categorical variables can be challenging, as many models are designed to handle numerical data. As a result, some people may need clarification about correctly processing categorical data, leading to confusion and potentially suboptimal model performance.

This article aims to provide a clear and comprehensive overview of the most popular approaches to handling categorical data in Machine Learning. By understanding the different options available and their implications, I hope to provide readers the knowledge and tools they need to handle categorical data in their Machine Learning projects.

Categorical Data in MachineΒ Learning

Categorical data consists of data that can be classified into categories. In machine learning, it is common to encounter categorical data from variables such as gender, race, nationality, genre, or occupation. Categorical data is often present in real-world datasets, and it is vital to handle it properly.

One of the main challenges of working with categorical data is that most machine learning algorithms are designed to work with numerical data. This means that categorical data must be transformed into a numerical format to be used as input to theΒ model.

Dealing with categorical data

This section will explore some popular methods for dealing with categorical data in machine learning.

What is β€œReplacing for Numbers”?

Replacing for numbers refers to the process of replacing a categorical variable with a numerical value.

For example, continuing with the example above, if we replaced the categorical variable with numerical values, we would get the following:

Example of Replacing | Image byΒ Author

Here’s the python code using replace in a Pandas data frame as a reference:

df.replace({'rock': 0, 'jazz': 1, 'blues': 2})

What is a β€œLabel Encoder”?

Label Encoder is another method for encoding categorical variables. It assigns a unique numerical value to each category in the categorical variable.

Using Label Encoder on the previous example would result in the same values as replacing. While replace might be a suitable approach for a small number of categories, it can become impractical when dealing with many categories.

Example of Label Encoder | Image byΒ Author

Here’s the Python code using the LabelΒ Encoder:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(df['genres'])

df['genres'] = le.transform(df['genres'])

What is converting to a β€œdummy variableβ€œ?

It is the process of creating a new binary column for each category in a categorical variable, with a 0 or 1 indicating the presence or absence of that category, suchΒ as:

Example of Dummy | Image byΒ Author

There are two ways of doing that. The first is using get_dummies() of PandasΒ library:

import pandas as pd

X_encoded = pd.get_dummies(df, columns=['genres'])

The other is using OneHotEncoder() of Scikit-learn (sklearn):

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit(df)

X_encoded = enc.transform(df).toarray()

Dummifying and One Hot Encoding are essentially the same things. The main difference is that β€œdummify” is a more colloquial term, and β€œOne Hot encoding” is the technical term used in the machine learning literature.

Why are Dummies Preferred Over the other solutions?

There are several reasons why Dummies are generally preferred over other encodingΒ methods:

Avoiding implied ordinal relationships and preventing bias

Dummies create separate columns for each category, allowing the model to learn the relationships between the individual categories and the target variable. Replacing for numbers and label encoder, on the other hand, imply an ordinal relationship between the categories and does not create separate columns for each category, which can lead to misleading results if the categories do not have an inherentΒ order.

For example, suppose you replace β€œrock” with 1, β€œjazz” with 2, and β€œblues” with 3 in your dataset. In that case, your model may assume that β€œjazz” is twice as important as β€œrock” and β€œblues” is three times as important as β€œrock”. This can introduce bias into the model, as it makes assumptions about the order in which you assign theΒ numbers.

Dummies allow the model to learn more complex relationships

Because it creates separate columns for each category, the model can learn more complex relationships between the categories and the target variable.

On the other hand, the other mentioned encoders only allow the model to learn the overall relationship between the numerical value and the target variable, which may not capture the full complexity of theΒ data.

When to AvoidΒ Dummies

There are certain situations in which Dummies may not be the best approach. Here are the most important ones:

  • High cardinality: One Hot Encoding creates a separate column for each category in the categorical variable. This can lead to a high number of columns, especially if the categorical variable has many unique values. In such cases, One Hot Encoding may result in a sparse and unwieldy data set, which can be challenging to workΒ with.
  • Memory constraints: One Hot Encoding can also be problematic if the data set is large and requires a lot of memory to store. The resulting data set can take up a lot of space, which may not be feasible if memory isΒ limited.
  • Multicollinearity: Occurs when there is a high correlation between the dummy variables, which can cause the coefficients in the model to be unstable and difficult to interpret. Dummy variables are naturally correlated because they are created from the same categorical variable.

In these situations, alternative encoding methods, such as label encoder or target encoding, may be more appropriate, which can handle high cardinality more efficiently.

If you are interested in learning more about multicollinearity and target encoding, there are many other resources available. You might want to check out the following articles:

Beware of the Dummy variable trap in pandas

This article discusses the issue of multicollinearity in detail and provides tips on how to deal with it going further into the parameters of OneHotEncoder()and to_dummy()functions.

Target-encoding Categorical Variables

This article comprehensively analyzes target encoding to solve the dimensionality problem.

I hope this article has helped you to build confidence when deciding how to handle categorical variables in your dataset and when to consider getting out of your one-hot encoding comfortΒ zone.

It is essential to carefully consider the data's characteristics and the model's requirements when deciding which encoding method to use. The two articles referenced in this post are excellent references. Check themΒ out!

If you’re interested in reading other articles written by me. Check out my repo with all articles I’ve written so far, separated by categories.

Thanks forΒ reading


Effective Categorical Variable Encoding for Machine Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓