From Raw to Refined: A Journey Through Data Preprocessing — Part 4: Data Encoding

Last Updated on December 21, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

Why data encoding is necessary

Humans can understand textual information. However, this is not the case for machines and any algorithms that machines run. Machines and algorithms only understand the numbers and the mathematical operations of the numbers. Therefore, if one needs to communicate textual information with machines or if one needs to input textual information to the algorithms, then the said information should be first converted into the numerical format that could represent the information.

Data encountered in machine learning and deep learning tasks

Usually, the textual data we encounter in machine learning is discrete, such as columns with limited textual values. For example, the data could contain a column named ‘day of the week’ for which only seven values are possible. Another example could be the column ‘Seasons,’ for which only four values are possible. This kind of textual data can be encoded using ordinal or one-hot encoding methods. These methods are pretty easy to implement with the built-in classes present in the Scikit-Learn library.

On the other hand, in deep learning, especially in natural language processing, we will encounter textual data that is in the format of sentences or whole paragraphs. For this kind of textual data, we use a different approach. For this kind of data, we first clean the data to make it proper for encoding. After the cleaning, we use encoding methods such as CountVectorizer, TfidfVectorizer, and HashingVectorizer from the Scikit-Learn library. Another critical method that is more sophisticated is known as word embedding. We can use the methods present in the Python libraries, such as Tensorflow and gensim to encode the data using word embedding methods.

Methods used to encode the textual data that is usually found in machine learning tasks

Ordinal Encoding and One-hot Encoding are the two most popular techniques for dealing with this type of textual data. Of course, other methods can deal with such data, but people use them less frequently.

Let’s use the ‘tips’ dataset in the Seaborn library to demonstrate these methods.

Ordinal Encoding

This encoding type is often used with the data that can be ranked. For example, if we take an example of days in the week, we can rank them from Sunday as 0 to Saturday as 6.

Let’s see how it is done.

## Importing required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Importing the tips dataset using seaborn
tips = sns.load_dataset('tips')
tips.head()

Let’s use the ordinal encoding method on the ‘day’ column.

from sklearn.preprocessing import OrdinalEncoder

enc = OrdinalEncoder()
transformed_day_column = enc.fit_transform(tips[['day']])
transformed_day_column

Many more rows are in the output, but I am just showing some of them to save space.

2. One-hot encoder

We use the one-hot encoding method when our textual data in columns doesn’t have any order. One example would be the column ‘sex’ in the tips dataset. Since the sex has no order or hierarchy, we can use one-hot encoding in this case.

Let’s see how to perform one-hot encoding on the ‘sex’ column of the tips dataset.

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
transformed_sex_column = enc.fit_transform(tips[['sex']])
transformed_sex_column

Again, many more rows are in the output, but I am just showing some of them to save space.

Methods used to encode the textual data that is usually found in deep learning task

Scikit-Learn provides some classes for very basic encoding of textual data. Some of these classes are CountVectorizer, TfidfVectorizer, and HashingVectorizer. These classes will help us encode the textual data but they are far from efficient in representing the intent or nature of textual data.

The most efficient technique that is used for the encoding of textual data is called word embedding.

Word embeddings are a type of word representation that allows words with similar meanings to have a similar representation. Word embeddings are techniques where individual words are represented as real-valued vectors in a predefined vector space.

There are three algorithms that can be used to learn the word embedding from the text corpus.

Learning the word embeddings (Image by Author)

Embedding layer

An embedding layer is a word embedding learned jointly with a neural network model on specific natural language processing tasks, such as language modeling or document classification.

It requires that document text be cleaned and prepared such that each word is one-hot encoded. The size of vector space is specified as part of the model, such as 50, 100, or 300 dimensions. The vectors are initialized with small random numbers. The embedding layer is used on the front end of a neural network and is fit in a supervised way using the backpropagation algorithm.

The one-hot encoded words are mapped to the word vectors. If a multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model. If a recurrent neural network is used, each word may be taken as one input in a sequence.

This approach of learning an embedding layer requires a lot of training data and can be slow, but it will learn an embedding both targeted to the specific text data and the NLP task.

Check out the following jupyter notebook for implementing the embedding layer into the neural network.

IMDB sentiment analysis – keras TextVectorization

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

www.kaggle.com

2. Word2Vec

Word2Vec is a method for effectively learning a standalone word embedding from a text corpus.

These standalone word embeddings could be used to do the math of the representation of words.

For example,

subtracting the ‘man-ness’ from ‘king’ and adding ‘women-ness’ will give us the meaning of the word ‘queen.’ The analogy is here is

King → Queen => Man → Women

Two learning models that can be used as part of the word2vec approach to learning word embedding were introduced. They are:

a. continuous bag-of-words or CBOW model

b. continuous skip-gram model

The continuous bag of words model learns the embedding by predicting the current word based on its context (surrounding words). On the other hand, the continuous skip-gram model learns by predicting the contexts (surrounding words) given a current word.

Word2Vec Training Models
Taken from “Efficient Estimation of Word Representations in Vector Space”, 2013

3. Global Vectors for Word Representation (GloVe)

This is another approach to learning the embedding of the textual data.

Ways to use the embedding methods

Ways to use the embedding (Image by Author)

Learning the embedding

In this approach, we learn the embedding for our problem using the textual data available to us. If we are to learn the accurate embedding that could represent the nature of words in our corpus from scratch, we will need a large amount of textual data, maybe even billions of words. There are two ways to learn the embedding:

learning the embedding without any other networks

This approach is used to learn the embedding when we need to same embedding for many tasks. Here, we will learn the embedding, save it, and then use it for as many tasks as we require.

learning the embedding along with the neural networks used for our specific task

This approach is used to learn embedding when we need not use the embedding for more than one task. Therefore, we will learn the embedding for our data while training the neural network.

2. Reusing the embedding

Everyone cannot afford to get a large amount of data to get their embedding vector. So, the solution, in this case, is to use freely available pre-trained embedding vectors from the internet. Once we obtain these freely available embeddings, we can make use of them in the following two ways.

Use the downloaded embeddings as they are

Once we download the embeddings from the internet, we can use them directly to train our neural network for our task. You can check out the following Kaggle notebook to know how to do this in code.

IMDB Sentiment analysis – pretrained embeddings

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

www.kaggle.com

Updating the downloaded embedding

We can update the downloaded embedding along with the neural network to tailor the embedding for our task at hand.

References:

User guide: contents

User Guide: Supervised learning- Linear Models- Ordinary Least Squares, Ridge regression and classification, Lasso…

scikit-learn.org

What Are Word Embeddings for Text?

Are you struggling to choose what to read next? Don’t worry, I have got you covered.

Shivam Shinde

Connect with me on LinkedIn
Similarly, you can follow me on Medium

Have a great day!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From Raw to Refined: A Journey Through Data Preprocessing — Part 4: Data Encoding

Author(s): Shivamshinde

IMDB sentiment analysis – keras TextVectorization

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

IMDB Sentiment analysis – pretrained embeddings

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

User guide: contents

User Guide: Supervised learning- Linear Models- Ordinary Least Squares, Ridge regression and classification, Lasso…

NLP Starter U+1F4CB Continuous Bag of Words (CBOW)

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from U.S. Patent Phrase to Phrase Matching

Implementing Deep Learning Methods and Feature Engineering for Text Data: The Continuous Bag of…

The CBOW model architecture tries to predict the current target word (the center word) based on the source context…

From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

This article will explain how to identify duplicate records in the data and, the different ways to deal with the…

From Raw to Refined: A Journey Through Data Preprocessing — Part 2 : Missing Values

Why deal with missing values?

Shivam Shinde

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Secret to Unlocking Deeper SWOT Analysis with AI (The Code That Started It All — and How I Took It to the Next Level)

Evaluating and Monitoring LLM Agents: Tools, Metrics, and Best Practices

Building Multi-Agent AI Systems From Scratch: OpenAI vs. Ollama

Web-LLM Assistant: Bridging Local AI Models With Real-Time Web Intelligence

ChatGPT Gets Windows App

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From Raw to Refined: A Journey Through Data Preprocessing — Part 4: Data Encoding

Author(s): Shivamshinde

IMDB sentiment analysis – keras TextVectorization

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

IMDB Sentiment analysis – pretrained embeddings

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from [Private Datasource]

User guide: contents

User Guide: Supervised learning- Linear Models- Ordinary Least Squares, Ridge regression and classification, Lasso…

NLP Starter U+1F4CB Continuous Bag of Words (CBOW)

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from U.S. Patent Phrase to Phrase Matching

Implementing Deep Learning Methods and Feature Engineering for Text Data: The Continuous Bag of…

The CBOW model architecture tries to predict the current target word (the center word) based on the source context…

From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

This article will explain how to identify duplicate records in the data and, the different ways to deal with the…

From Raw to Refined: A Journey Through Data Preprocessing — Part 2 : Missing Values

Why deal with missing values?

Shivam Shinde

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement