Overview of Exploratory Data Analysis (EDA) With Haberman Dataset
Last Updated on July 19, 2023 by Editorial Team
Author(s): Rajvi Shah
Originally published on Towards AI.
Data Analysis
A practical guide to get started to gain insights from your data.
βData will talk, if you are willing to listenβ- Jim Bergeson
With the proper use of data, one can gain insights and use it for numerous purposes. Raw data has no story to tell. So, to understand and gain insights from data, after the data collection process, exploratory data analysis comes into the picture. It is a crucial process to recognize patterns and understand data to prepare the model.
This article is divided into the following sections:
- Overview of Data Exploratory Analysis (EDA)
- EDA for Habermanβs dataset
Overview of Data Exploratory Analysis (EDA):
What is EDA?
The process to explore and understand data to gain insights from the data. It can be context as βLook at the first sightβ for solving any Data Science problem. It takes a step closer to the goal of solving the problem at hand.
Why apply EDA?
To summarize important features, to recognize patterns & distribution curves, to detect outliers and anomalies, to find a number of classes, or distribution of data/classes, to test underlying assumptions, etc by analyzing and visualizing. Basically to know what the data is trying to say!
EDA process
- Question
- Verify
- Write
- Repeat
EDA for Habermanβs Survival dataset:
This dataset is helpful to get a start for EDA. To download the dataset, click on the link. While performing EDA, the main thing to keep the goal of the project in mind. In this case, the goal is to predict that a person would survive or not from surgery.
In the first step, we will try to understand the dataset and what it contains by asking the following questions:
What kind of data it is?
How the output data(y-labels) are represented?
Whether the classes(survival state or non-survival state) are uniformly distributed?
How many features are included to predict the class?
Now, after asking the right questions, we will write some code and know the answer, many times by looking at the data, one can know some answers, then it is important to verify your answers or intuitions of assumptions during this step.
Following this, know your output and note down the observations.
For example; from the above code, one can understand:
Observations:
- Survival status of surgery success(1) is almost 2.5 times compare to failure(2)
- Data is not uniformly distributed
For the next step,
Which features of the dataset are important?
How the features are related to each of the classes?
Is there any pattern for a class and features/features?
This can be answered by univariate, bi-variate plotting or by plotting various graphs based on the dataset like box-plot, histograms, violin-plots, etc. Here, I have shown some of the solutions, to know more, check on Github (it contains all detailed EDA on Habermanβs dataset).
Univariate plotting:
Output:
Observations:
- Both classes overlap each other and so age attribute should not be given much importance for survival rate or non_survival rate.
- The dataset contains age ranging from 58 to 70.
Output:
Observations:
- Most of the values of both classes are overlapped, so we cannot predict anything only on the basis of auxiliary.
- Most of the survival cases and non-survival cases range between 0 to 25.
Bi-Variate plotting:
Output:
Observations:
- It is impossible to differentiate either of the survival states with decision boundary as all are scattered in the whole graph.
- In the graph, age vs auxiliary node as the value of age and number of nodes increases β the chances of survival state decreases are observed with minimal exceptions.
To detect outliers, basic statistical concepts such as mean, median, mode, percentiles, quartiles, etc are used:
For example;
Output:
Observation:
1. Not necessarily, as the number of affected nodes increases, there are more likely chances of non-survival.
Likewise, try applying various statistical concepts to know your dataset. I have applied some concepts in the final ipynb.
Mainly, the mean is not useful in detecting outliers, the median can be useful to do that in many cases.
Whisker and Violin plots:
This can help in getting an idea of the range of values of a particular feature.
Output:
CDF and PDF:
Cumulative Density Function & Probability Density Function β gives a percentage of values/ helps to find error percentage.
Output:
Observation:
1. Both classes are overlapping; unable to identify.
Now, plotting feature-wise CDF and PDF;
Output:
Observations:
- It is tough to interpret one state of patient and the data is non-uniformly distributed.
- Somehow, it can be possible that the number of auxiliary nodes and age increases, the rate of survival decreases.
Thatβs all.
I have covered the basic approach to explore data. There are many different ways to perform EDA, it all depends on the dataset and on the observation.
You can find the source code from Github
If you have confusion regarding any function/class of the library, then I request you to check the documentation for that. Also, in the code, I have used many concepts that I havenβt explained. I request you to understand some basics of Statistics and Probability, as both topics are the core for Data Analysis.
For more, check out:
https://www.youtube.com/watch?v=YXLVjCKVP7U
https://medium.com/analytics-vidhya/pdf-pmf-and-cdf-in-machine-learning-225b41242abe
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
https://medium.com/@srimalashish/why-eda-is-necessary-for-machine-learning-233b6e4d5083
If thereβs any correction & scope of improvement or if you have any queries, feel free to connect me via LinkedIn.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI