How To Compare 2 Datasets With Pandas-profiling

Last Updated on January 6, 2023 by Editorial Team

Last Updated on November 24, 2022 by Editorial Team

Author(s): Fabiana Clemente

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

A data quality use case with advanced EDA

How To Compare 2 Datasets With Pandas-profiling — Pandas-profiling compare report (screenshot by the author)

Visualization is the cornerstone of EDA. When facing a new, unknown dataset, visual inspection allows us to get a feel of the available information, draw some patterns regarding the data, and diagnose several issues that we might need to address. In that regard, Pandas Profiling has been the indispensable swiss-knife in every data scientist’s tool belt. In my past articles, I’ve mentioned how pandas profiling can be helpful while performing time-series EDA, but what if we could compare two datasets?

How many of us have started the development of a data science project and struggle to understand how much are we getting from our data transformations and engineering?

And that’s exactly what I’ll be covering in today’s blog post — how to leverage the most famous single line of code EDA to boost the process of data science development and data quality improvements. I’ll give you a tour of how to leverage Pandas-Profiling comparison report functionality to boost your EDA process and illustrate its potential in producing faster and smarter transformations on our data.

The dataset used in this article can be found in Kaggle, the HCC Dataset by Miriam Santos (License: CC0: Public Domain). For this particular use case, I’ve artificially introduced some additional data quality issues to show you how visualization can help us detect them and guide us toward their efficient mitigation. All code and examples are available on GitHub and in case you need a little refresher, make sure to check this blog to dust off your pandas-profiling skills. So, on with our use case!

Pandas Profiling: EDA at your fingertip

We’ll start by profiling the HCC dataset and investigating the data quality issues suggested in the report:

pip install pandas-profiling==3.5.0

*Alerts shown in Pandas Profiling Report (scheenshot by author)*

According to the “Alerts” overview, there are four main types of potential issues that need to be addressed:

Duplicates: 4 duplicate rows in data;
Constant: Constant value “999” in ‘O2’;
High Correlation: Several features marked as highly correlated;
Missing: Missing Values in ‘Ferritin’.

The validity of each potential problem (as well as the need to find a mitigation strategy for it) depends on the specific use case and domain knowledge. In our case, with the exception of the “high correlation” alerts, which would require further investigation, the remaining alerts seem to reflect true data quality issues and can be tackled using a few practical solutions:

Removing Duplicate Rows: Depending on the nature of the domain, there might be records that have the same values without it being an error. However, considering that some of the features in this dataset are quite specific and refer to an individual’s biological measurements (e.g., “Hemoglobin”, “MCV”, “Albumin”), it’s unlikely that several patients report the same exact values for all features. Let’s start by dropping these duplicates from the data:

Removing Irrelevant Features: The constant values in O2 also reflect a true inconsistency in data and do not seem to hold valuable information for model development. In real use case scenarios, it would be a good standard to iterate with a domain or a business experts, but for the purpose of this use case example, we’ll go ahead and drop them from the analysis:

Missing Data Imputation: HCC dataset also seems extremely susceptible to missing data. A simple way to address this issue (avoiding removing incomplete records or entire features) is resorting to data imputation. We’ll use mean imputation to fill in the absent observations, as it is the most common and simple of statistical imputation techniques and often serves as a baseline method:

Side-by-side comparison: faster and smarter iterations on your data

Now for the fun part! After implementing the first batch of transformations to our dataset, we’re ready to assess their impact on the overall quality of our data. This is where the pandas-profiling compare report functionality comes in handy. The code below depicts how to get started:

Here’s how both reports are shown in the comparison:

Comparing Original Data *and* Transformed Data (screencast by the author)

What can we right away understand from our dataset overview? The transformed dataset contains one less categorical feature (“O2” was removed), 165 observations (versus the original 171 containing duplicates), and no missing values (in contrast with the 79 missing observations in the original dataset).

But how have this transformations impacted the quality of our data? And how good were those decisions?

Let’s dive deep into that. In what concerns duplicate records, there was no particular impact in what concerns variables distributions and dataset patterns after the drop. The missing values imputation that was done is a different story.

As expected, there are no missing observations after the data imputation was performed. Note how both the nullity count and matrix show the differences between both versions of the data: in the transformed data, “Ferritin” now has 165 complete values, and no blanks can be found in the nullity matrix.

Comparison Report: Missing Values (screencast by Author)

However, we can infer something else from the comparison report. If we were to inspect the “Ferritin” histogram, we’d see how imputing values with the mean has distorted the original data distribution, which is undesirable.

*Comparison Report: Ferritin — imputed values seem to distort the original feature distribution (Screenshot by author)*

This is also observed through the visualization of interactions and correlations, where daft interaction patterns and higher correlation values emerge in the relationship between “Ferritin” and the remaining features.

*Comparison Report: Interactions between Ferritin and Age: imputed values are shown in a vertical line corresponding to the mean (Screenshot by author)*

*Comparison Report: Correlations — Ferritin correlation values seem to increase after data imputation (screenshot by author)*

This comes to show that the comparison report is not only useful for highlighting the differences introduced after data transformations, but it provides several visual cues that lead us towards important insights regarding those transformations: in this case, a more specialized data imputation strategy should be considered.

Final Thoughts

Throughout this small use case, we’ve covered the usefulness of comparing two sets of data within the same profiling report to highlight the data transformations performed during EDA and evaluate their impact on data quality.

Nevertheless, the applications of this functionality are endless, as the need to (re)iterate on feature assessment and visual inspection is vital for data-centric solutions. From comparing train, validation, and test sets distributions or data quality control to more advanced use cases such as for the process of synthetic data generation.

Fabiana Clemente, CDO at YData

Accelerating AI with improved data.

How To Compare 2 Datasets With Pandas-profiling was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

How To Compare 2 Datasets With Pandas-profiling

Author(s): Fabiana Clemente

A data quality use case with advanced EDA

Pandas Profiling: EDA at your fingertip

Side-by-side comparison: faster and smarter iterations on your data

Final Thoughts

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

No Code, No Limits: The Best Open-Source AI UIs in 2025

LLMs Don’t Need Search Engines: They Can Search Their Own Brains

This Plug-and-Play AI Memory Works With Any Model

From Prompts to RAG to RAGAs: Evaluating Retrieval-Augmented Generation Systems the Right Way

“BIOREASON” Makes DNA Analysis Simple Using AI

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How To Compare 2 Datasets With Pandas-profiling

Author(s): Fabiana Clemente

A data quality use case with advanced EDA

Pandas Profiling: EDA at your fingertip

Side-by-side comparison: faster and smarter iterations on your data

Final Thoughts

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement