From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

Last Updated on August 28, 2023 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will explain how to identify duplicate records in the data and, the different ways to deal with the problem of having duplicate records.

Photo by Pineapple Supply Co. on Unsplash

Why the presence of duplicate records in data is a problem?

The presence of duplicate values in the data is often ignored by many programmers. But, dealing with the duplicate records in the data is quite important.

Having duplicate records can lead to incorrect data analysis and decision-making.

For example, what happens when you replace missing values (imputation) with mean in data with duplicate records?

In this scenario, the incorrect mean value may be used for imputation. Let’s take an example.

Consider the following data. Data contains two columns, namely Name and Weight. Note that the weight value for the ‘John’ is repeated. Also, the weight value for ‘Steve’ is missing.

If one wishes to impute the missing value of weight for Steve with the mean of all the weight values, then the imputation would be done using the incorrect mean, i.e.,

(98 + 50 + 67 + 66 + 50)/5 = 66.2

But the actual mean of the data by ignoring the duplicate value is

(98 + 50 + 67 + 66)/4 = 70.25

Therefore, the missing value will be incorrectly imputed if we do not do something about the duplicate records.

Moreover, duplicate values may even affect the business decisions that are made using such faulty data.

In summary, duplicate records from the data should be dealt with to keep the data free from problems.

Now, let’s see different methods to deal with the duplicate records in the data.

Identifying the duplicate values in the data

We can use the pandas duplicated method to identify the rows that are duplicated in the data.

Now, let’s understand the duplicate values using an example.

## Importing required libraries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## Creating a dataframe
Name = ['Mark', 'John', 'Jotaro', 'Mark', 'John', 'Steve']
Weight = [98, 50, 67, 66, 50, np.nan]
Height = [170, 175, 172, 170, 175, 166]

df = pd.DataFrame()
df['Name'] = Name
df['Weight'] = Weight
df['Height'] = Height
df

Identifying duplicate values:

## Identifying the duplicated values (default behavior)
df.duplicated()

We get the value True, where the duplicate record is present, and False where the unique records are present.

Note that by default, duplicated() method uses all the columns to find duplicate records. But, we can use the subset of columns to find the duplicates as well. To do this, duplicated() method has a parameter named subset. The subset parameter takes the list of column names that we want to use for finding duplicates.

## subset parameter of duplicated() method
df.duplicated(subset=['Name','Height'])

Additionally, duplicated() method has one more important parameter named keep. The value of the keep parameter decides whether we consider the first record or the last record as unique in all the duplicated records. We also have the option where we can consider all the duplicate records as non-unique.

keep = ‘first’: The first record in all of the duplicate records is considered unique

keep = ‘last’: The last record in all of the duplicate records is considered unique

keep = False: All the duplicate records are considered as non-unique.

## keep parameter of duplicated() method
df.duplicated(keep='first')

Notice here that the first duplicate value (at index 1) is considered unique and all the others (at index 4) are considered duplicates.

## keep parameter of duplicated() method
df.duplicated(keep='last')

Notice here that the last duplicate value (at index 4) is considered unique and all the others (at index 1) are considered duplicates.

## keep parameter of duplicated() method
df.duplicated(keep=False)

Notice here that all the duplicate records (at index 1 and index 4) are shown.

How to deal with duplicate records in data

The next step after identifying the duplicate records is to deal with them.

There are two ways to deal with the duplicate records in the data.

Removing the duplicate records

Let’s start with the approach where we remove the duplicate records.

We can make use of the pandas drop_duplicates() method for this.

By default, drop_duplicates() method keeps the first record from the set of all the duplicate records and then drops the rest of them from the data. Also, by default, drop_duplicates() method uses all the columns to identify the duplicate records.

But this default behavior can be changed using the two parameters of the drop_duplicates() method. They are

keep
subset

They work exactly how the keep and subset parameters of duplicated() method work.

"""dropping duplicate values using the pandas drop_duplicates() method 
 (default behavior)"""

df1 = df.drop_duplicates()
df1

"""dropping duplicate values using the pandas drop_duplicates() 
 method with subset and keep parameters (custom behavior)"""

df2 = df.drop_duplicates(subset=['Weight'], keep='last')
df2

Updating the duplicate records

Sometimes we want to replace the duplicate records with some value. Let’s say we found two duplicate records, then we got to know that the person who acquired the data accidentally put the wrong name in one of the duplicate records. So in such a case, we would want to put the name of the correct person. Thus, this would solve the problem of duplicate values.

df.duplicated(keep=False)

Here, we have duplicate records at the indices 1 and 4. Now, if we changed the value of ‘Name’ column at index 1, then we won’t have duplicate values anymore.

## changing the 'Name' value for the first duplicate record
df.iloc[1, 0] = 'Dio' 
df

We changed the ‘Name’ value for the first duplicate record. Now, let’s again check if there are duplicated records present in the data or not.

df.duplicated()

Now, we don’t have any duplicate records.

Thanks for reading! If you have any thoughts on the article, then please let me know.

Are you struggling to choose what to read next? Don’t worry, I know of an article that I believe you will find interesting.

Shivam Shinde

Connect with me on LinkedIn
Similarly, you can follow me on Medium

Have a great day!

References:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

Author(s): Shivamshinde

This article will explain how to identify duplicate records in the data and, the different ways to deal with the problem of having duplicate records.

From Raw to Refined: A Journey Through Data Preprocessing — Part 2 : Missing Values

Why deal with missing values?

From Raw to Refined: A Journey Through Data Preprocessing — Part 1: Feature Scaling

Sometimes, the data we receive for our machine learning tasks isn’t in a suitable format for coding with Scikit-Learn…

Shivam Shinde

Handling Duplicate Values in a Pandas DataFrame

As a data analyst, it is our responsibility to ensure data integrity to obtain accurate and trustworthy insights. Data…

Find duplicate rows in a Dataframe based on all or selected columns – GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and…

Pandas DataFrame duplicated() Method

W3Schools offers free online tutorials, references and exercises in all the major languages of the web. Covering…

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From Raw to Refined: A Journey Through Data Preprocessing — Part 3: Duplicate Data

Author(s): Shivamshinde

This article will explain how to identify duplicate records in the data and, the different ways to deal with the problem of having duplicate records.

From Raw to Refined: A Journey Through Data Preprocessing — Part 2 : Missing Values

Why deal with missing values?

From Raw to Refined: A Journey Through Data Preprocessing — Part 1: Feature Scaling

Sometimes, the data we receive for our machine learning tasks isn’t in a suitable format for coding with Scikit-Learn…

Shivam Shinde

Handling Duplicate Values in a Pandas DataFrame

As a data analyst, it is our responsibility to ensure data integrity to obtain accurate and trustworthy insights. Data…

Find duplicate rows in a Dataframe based on all or selected columns – GeeksforGeeks

A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and…

Pandas DataFrame duplicated() Method

W3Schools offers free online tutorials, references and exercises in all the major languages of the web. Covering…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement