The Traps of Blindly Trusting Data: Simpson’s Paradox
Last Updated on February 15, 2024 by Editorial Team
Author(s): Renu Khandelwal
Originally published on Towards AI.
Demystifying Simpson’s Paradox for Reliable Data Insights
Data speaks volumes, but it needs to be understood to truly be heard
Photo by Edurne Tx on Unsplash
In the fall of 1973, the University of California at Berkeley released admissions data about their graduate class.
The news broke: UC Berkeley sued for gender discrimination!
At first glance, the numbers appeared damning: 44.3% of male applicants gained admission, while only 34.6% of females were accepted.
“Data doesn’t lie,” some declared. But does it tell the whole story?
Upon closer examination, researchers delved deeper, analyzing admission rates across different majors.
Image generated using matplotlib based on data.
A fascinating twist emerged, as shown above: the apparent gender bias vanished, and in some instances, it even reversed!
How can the same data yield seemingly conflicting conclusions?
The answer lies in Simpson’s Paradox, a statistical quirk first described by Edward H. Simpson in 1951.
Simpson’s Paradox is a statistical phenomenon where an association between two variables within a population emerges, disappears, or reverses when the population is divided into subpopulations.
In UC Berkeley’s case, Simpson’s paradox was displayed as the association between a pair of variables X and Y(Gender, Acceptance% ) reverses sign upon a third conditioning variable, Z(Majors), irrespective of Z’s value.
It highlights the importance of analyzing data at different levels of granularity.It… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI