Surprisingly Powerful Dataset Exploration Techniques For Rookies
Last Updated on July 20, 2023 by Editorial Team
Author(s): Beltus Nkwawir
Originally published on Towards AI.
Lets explore the COVID-19 dataset and others with interestingly simple yet powerful techniques.
COVID-19 dataset download complete!
What now? You must be thinking.
That is exactly what I used to think when I started this long yet exciting journey into the planet of data science.
I will stare at a dataset folder on my computer screen like a sheep that just lost itβs shepherd. Lost, confused and utterly frustrated I find myself watching a youtube clip of Sheldon β Big Bang Theory.
Slowly but surely mission failed.
If you are new to data science, it is daunting and overwhelming to get started with understanding the contents of any dataset amid the ocean of datasets existing today. Every second that ocean expands further.
It is imperative every rookie in data science familiarize themselves with the very basic techniques of exploring the dataset because it is the fuel that powers AI.
Working on a machine learning project without first of all understanding the dataset you are working with, is like trying to build a skyscraper without a blueprint. Think about that for a second.
In todayβs article, I will share a couple of basic yet powerful exploration techniques that you can apply to almost any dataset.
These techniques will help you break through datasets. Not only that, but you will also have a holistic understanding of the dataset you are working with.
This should be enough to activate the flow of dopamine in your system so that you can frictionlessly complete your machine learning project while having fun.
Okay, enough of the talk. Let's get to it.
Prerequisite
For this article, we will be using Pandas which is one of the most popular data structure libraries for machine learning. It has many interesting built-in functions that we will be exploring shortly.
Some people use pandas together with Matplotlib and Seaborn statistical data visualization purposes. Feel free to check them out.
For those who have never used Pandas library before do not break a sweat over it. I wrote this article with you in mind.
In order to use the Pandas library, we first need to install it.
pip install pandas
I am using jupyter notebook to run all the codes in this article. If you donβt have a Jupiter notebook set up yet, I got you covered. Check out the Datacamp site.
At the end of this article, I will share a link to my GitHub page where you can access the complete notebook. So stay frosty till the end.
Import Relevant Libraries
To have access to the amazing Pandas functions, we need to import it into our code. I imported it with the alias pd. Pandas is built upon NumPy which is very useful for mathematical operations
Load Dataset into Memory.
The pandas .read_csv() method reads the dataset file and saves it as a Pandas Dataframe object. Think of a data frame as a table with labels rows and columns. Here I am using the COVID-19 Dataset from Kaggle
At the end of the article, I will provide links to all the different datasets I used in this article. So donβt fret about it.
Display the Number of Samples (Observations) in Dataset.
For any dataset you come across, it is important to know the number of data points or training examples present in your dataset. This is simply the number of rows and columns.
Number of Samples or observations: 18056
Number of Attributes or Features 8
If you interesting in knowing just the number of rows, just do this.
Number of Samples: 18056
Take a Quick Peek at the Contents of your Dataset
What exactly is the content of your dataset? Pandas head() and tail() helps you with that. By default head() prints the first 5 rows and tail the last 5 rows of your dataset.
You can as well specify the number of rows you want to display by passing the number of rows as an argument to the functions.
Sometimes your dataset will have a large number of columns such that you wonβt be able to view them all. Unless you are working with a gigantic computer screen. Luckily, with the code snippet below, you can scroll left and right to see everything. Cool right?
Display Interesting Statistical Information.
Let's take a step further into gaining some statistical insights of our dataset. Personally, the .describe() method is a *must-know* function.
It provides some invaluable statistical information such as count, mean, standard deviation, etc of each of the numeric columns found in our dataset.
count row can be especially useful as it gives a clue of any missing values in your data that could negatively affect the performance of your machine learning model.
Removing Irrelevant Columns
In building a machine learning model, some features or columns might have zero contribution to the performance of the prediction. You can eliminate any unwanted column by using the snippet below. Axis specifies that it is column.
I dropped the Lat and Long columns in the COVID-19 dataset.
Note: It is good practice to assign the modified dataset to a new variable.
Number of Unique Samples in Dataset
This next code snippet is often very useful in machine learning especially when we are dealing with classification problems. It helps you to know the number of samples that belong to a particular category. An example is shown below
Renaming Weird Column Names
Datasets can be messy at times with weird column names given by the creators of the datasets. You donβt have to stick with these. If you don't like it, just change it like so.
Sometimes, you just need to change that specific column name that sounds like gibberish to your ears. Just do this.
Histogram Plot.
Let's spice things up a little, with visualizations. The first and most commonly use visualization tool is the histogram. Ensure that the column for which you are plotting its histogram contains only numeric values. We can optionally pass bin size as the only argument. Thatβs it.
For demonstration purposes, I am also using the Iris dataset to show interesting plots. The link is included at the end of the article.
If you are curious and interested in visualizing the histogram plots of all your features, you can generate multiple plots as indicated below. Subplot creates a histogram plot for each column and the layout specifies the number of a plot per row and column.
Bar Chart
This is an interesting visualization tool. However, in order to plot a bar chart using the plot.bar() method, we need to first count the occurrences using the value_count()
method and then sort the occurrences from smallest to largest using the sort_index()
method.
Here is a great example of the bar plot in the COVID-19 dataset. First group data by country, then calculate the mean of confirmed cases and sort in ascending other. Finally, plot a bar chart of the 10 countries with the most number of confirmed cases.
Scatter plot
Last but far from least is the scatter plot. Generally used to show the relationship between 2 columns or features in your dataset. Note that only numeric columns can be plotted.
Other Plots
There exist other interesting plots which I decided not to cover these on this post. However, I will mention these here with appropriate links for easy exploration.
* Line Chart
* Box plots
* Heatmap
* Pairplot
Bonus Tip
I came across this article published by towards data science written by Brenda Hali titled My Pandas Cheat Sheet and I found it very helpful. I think it contains very useful and handy pandas functions you will frequently find yourself in need of.
The Takeaway
No matter the dataset you came across on the internet, these simple data exploratory techniques should be the very first idea that runs to your mind. It will permit you to have a clue of what you are dealing with.
Understanding the contents, and relationships of a dataset will help you to decide if it is going to be useful for the project you are working on or its a total waste of time.
When you implement these pandas techniques outline on this post, the battle is already half when it has not started yet. Trust me, It will save you a lot of time and headaches.
Thanks for reading and hopefully it was helpful. If you have other interesting functions you think I and others should know about please, do not hesitate to leave a comment. Letβs learn and grow together.
If you loved this post, feel free to read some amazing post on my blog
References
Link to GitHub jupyter notebook with complete code.
link to Kaggle COVID-19 Dataset
Link to Iris Dataset
https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools
https://realpython.com/pandas-python-explore-dataset/
https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI