How To Compare 2 Datasets With Pandas-profiling
Last Updated on January 6, 2023 by Editorial Team
Last Updated on November 24, 2022 by Editorial Team
Author(s): Fabiana Clemente
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
A data quality use case with advancedΒ EDA
Visualization is the cornerstone of EDA. When facing a new, unknown dataset, visual inspection allows us to get a feel of the available information, draw some patterns regarding the data, and diagnose several issues that we might need to address. In that regard, Pandas Profiling has been the indispensable swiss-knife in every data scientistβs tool belt. In my past articles, Iβve mentioned how pandas profiling can be helpful while performing time-series EDA, but what if we could compare two datasets?
How many of us have started the development of a data science project and struggle to understand how much are we getting from our data transformations and engineering?
And thatβs exactly what Iβll be covering in todayβs blog postβββhow to leverage the most famous single line of code EDA to boost the process of data science development and data quality improvements. Iβll give you a tour of how to leverage Pandas-Profiling comparison report functionality to boost your EDA process and illustrate its potential in producing faster and smarter transformations on ourΒ data.
The dataset used in this article can be found in Kaggle, the HCC Dataset by Miriam Santos (License: CC0: Public Domain). For this particular use case, Iβve artificially introduced some additional data quality issues to show you how visualization can help us detect them and guide us toward their efficient mitigation. All code and examples are available on GitHub and in case you need a little refresher, make sure to check this blog to dust off your pandas-profiling skills. So, on with our useΒ case!
Pandas Profiling: EDA at your fingertip
Weβll start by profiling the HCC dataset and investigating the data quality issues suggested in theΒ report:
pip install pandas-profiling==3.5.0
According to the βAlertsβ overview, there are four main types of potential issues that need to be addressed:
- Duplicates: 4 duplicate rows inΒ data;
- Constant: Constant value β999β inΒ βO2β;
- High Correlation: Several features marked as highly correlated;
- Missing: Missing Values in βFerritinβ.
The validity of each potential problem (as well as the need to find a mitigation strategy for it) depends on the specific use case and domain knowledge. In our case, with the exception of the βhigh correlationβ alerts, which would require further investigation, the remaining alerts seem to reflect true data quality issues and can be tackled using a few practical solutions:
Removing Duplicate Rows: Depending on the nature of the domain, there might be records that have the same values without it being an error. However, considering that some of the features in this dataset are quite specific and refer to an individualβs biological measurements (e.g., βHemoglobinβ, βMCVβ, βAlbuminβ), itβs unlikely that several patients report the same exact values for all features. Letβs start by dropping these duplicates from theΒ data:
Removing Irrelevant Features: The constant values in O2 also reflect a true inconsistency in data and do not seem to hold valuable information for model development. In real use case scenarios, it would be a good standard to iterate with a domain or a business experts, but for the purpose of this use case example, weβll go ahead and drop them from the analysis:
Missing Data Imputation: HCC dataset also seems extremely susceptible to missing data. A simple way to address this issue (avoiding removing incomplete records or entire features) is resorting to data imputation. Weβll use mean imputation to fill in the absent observations, as it is the most common and simple of statistical imputation techniques and often serves as a baselineΒ method:
Side-by-side comparison: faster and smarter iterations on yourΒ data
Now for the fun part! After implementing the first batch of transformations to our dataset, weβre ready to assess their impact on the overall quality of our data. This is where the pandas-profiling compare report functionality comes in handy. The code below depicts how to getΒ started:
Hereβs how both reports are shown in the comparison:
What can we right away understand from our dataset overview? The transformed dataset contains one less categorical feature (βO2β was removed), 165 observations (versus the original 171 containing duplicates), and no missing values (in contrast with the 79 missing observations in the original dataset).
But how have this transformations impacted the quality of our data? And how good were those decisions?
Letβs dive deep into that. In what concerns duplicate records, there was no particular impact in what concerns variables distributions and dataset patterns after the drop. The missing values imputation that was done is a different story.
As expected, there are no missing observations after the data imputation was performed. Note how both the nullity count and matrix show the differences between both versions of the data: in the transformed data, βFerritinβ now has 165 complete values, and no blanks can be found in the nullityΒ matrix.
However, we can infer something else from the comparison report. If we were to inspect the βFerritinβ histogram, weβd see how imputing values with the mean has distorted the original data distribution, which is undesirable.
This is also observed through the visualization of interactions and correlations, where daft interaction patterns and higher correlation values emerge in the relationship between βFerritinβ and the remaining features.
This comes to show that the comparison report is not only useful for highlighting the differences introduced after data transformations, but it provides several visual cues that lead us towards important insights regarding those transformations: in this case, a more specialized data imputation strategy should be considered.
Final Thoughts
Throughout this small use case, weβve covered the usefulness of comparing two sets of data within the same profiling report to highlight the data transformations performed during EDA and evaluate their impact on dataΒ quality.
Nevertheless, the applications of this functionality are endless, as the need to (re)iterate on feature assessment and visual inspection is vital for data-centric solutions. From comparing train, validation, and test sets distributions or data quality control to more advanced use cases such as for the process of synthetic data generation.
Fabiana Clemente, CDO atΒ YData
Accelerating AI with improvedΒ data.
How To Compare 2 Datasets With Pandas-profiling was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI