Surprisingly Powerful Dataset Exploration Techniques For Rookies

Last Updated on July 20, 2023 by Editorial Team

Author(s): Beltus Nkwawir

Originally published on Towards AI.

Lets explore the COVID-19 dataset and others with interestingly simple yet powerful techniques.

COVID-19 dataset download complete!

What now? You must be thinking.

That is exactly what I used to think when I started this long yet exciting journey into the planet of data science.

I will stare at a dataset folder on my computer screen like a sheep that just lost it’s shepherd. Lost, confused and utterly frustrated I find myself watching a youtube clip of Sheldon — Big Bang Theory.

Slowly but surely mission failed.

If you are new to data science, it is daunting and overwhelming to get started with understanding the contents of any dataset amid the ocean of datasets existing today. Every second that ocean expands further.

It is imperative every rookie in data science familiarize themselves with the very basic techniques of exploring the dataset because it is the fuel that powers AI.

Working on a machine learning project without first of all understanding the dataset you are working with, is like trying to build a skyscraper without a blueprint. Think about that for a second.

In today’s article, I will share a couple of basic yet powerful exploration techniques that you can apply to almost any dataset.

These techniques will help you break through datasets. Not only that, but you will also have a holistic understanding of the dataset you are working with.

This should be enough to activate the flow of dopamine in your system so that you can frictionlessly complete your machine learning project while having fun.

Okay, enough of the talk. Let's get to it.

Prerequisite

For this article, we will be using Pandas which is one of the most popular data structure libraries for machine learning. It has many interesting built-in functions that we will be exploring shortly.

Some people use pandas together with Matplotlib and Seaborn statistical data visualization purposes. Feel free to check them out.

For those who have never used Pandas library before do not break a sweat over it. I wrote this article with you in mind.

In order to use the Pandas library, we first need to install it.

pip install pandas

I am using jupyter notebook to run all the codes in this article. If you don’t have a Jupiter notebook set up yet, I got you covered. Check out the Datacamp site.

At the end of this article, I will share a link to my GitHub page where you can access the complete notebook. So stay frosty till the end.

Import Relevant Libraries

To have access to the amazing Pandas functions, we need to import it into our code. I imported it with the alias pd. Pandas is built upon NumPy which is very useful for mathematical operations

import pandas and numpy libraries10

Load Dataset into Memory.

The pandas .read_csv() method reads the dataset file and saves it as a Pandas Dataframe object. Think of a data frame as a table with labels rows and columns. Here I am using the COVID-19 Dataset from Kaggle

At the end of the article, I will provide links to all the different datasets I used in this article. So don’t fret about it.

Display the Number of Samples (Observations) in Dataset.

For any dataset you come across, it is important to know the number of data points or training examples present in your dataset. This is simply the number of rows and columns.

Number of Samples or observations: 18056
Number of Attributes or Features 8

If you interesting in knowing just the number of rows, just do this.

Number of Samples: 18056

Take a Quick Peek at the Contents of your Dataset

What exactly is the content of your dataset? Pandas head() and tail() helps you with that. By default head() prints the first 5 rows and tail the last 5 rows of your dataset.

Displaying the First and Last five rows of Covid-19 dataset

You can as well specify the number of rows you want to display by passing the number of rows as an argument to the functions.

Prints the first and last 10 rows respectively

Sometimes your dataset will have a large number of columns such that you won’t be able to view them all. Unless you are working with a gigantic computer screen. Luckily, with the code snippet below, you can scroll left and right to see everything. Cool right?

screenshot of random dataset showing the scroll button

Display Interesting Statistical Information.

Let's take a step further into gaining some statistical insights of our dataset. Personally, the .describe() method is a *must-know* function.

It provides some invaluable statistical information such as count, mean, standard deviation, etc of each of the numeric columns found in our dataset.

screenshot of statistical information of COVID-19

count row can be especially useful as it gives a clue of any missing values in your data that could negatively affect the performance of your machine learning model.

Removing Irrelevant Columns

In building a machine learning model, some features or columns might have zero contribution to the performance of the prediction. You can eliminate any unwanted column by using the snippet below. Axis specifies that it is column.

I dropped the Lat and Long columns in the COVID-19 dataset.

Screenshot of Latitude and Longitude Columns Dropped

Note: It is good practice to assign the modified dataset to a new variable.

Number of Unique Samples in Dataset

This next code snippet is often very useful in machine learning especially when we are dealing with classification problems. It helps you to know the number of samples that belong to a particular category. An example is shown below

Renaming Weird Column Names

Datasets can be messy at times with weird column names given by the creators of the datasets. You don’t have to stick with these. If you don't like it, just change it like so.

Sometimes, you just need to change that specific column name that sounds like gibberish to your ears. Just do this.

Histogram Plot.

Let's spice things up a little, with visualizations. The first and most commonly use visualization tool is the histogram. Ensure that the column for which you are plotting its histogram contains only numeric values. We can optionally pass bin size as the only argument. That’s it.

For demonstration purposes, I am also using the Iris dataset to show interesting plots. The link is included at the end of the article.

If you are curious and interested in visualizing the histogram plots of all your features, you can generate multiple plots as indicated below. Subplot creates a histogram plot for each column and the layout specifies the number of a plot per row and column.

screenshot of subplots of four features from the Iris Dataset

Bar Chart

This is an interesting visualization tool. However, in order to plot a bar chart using the plot.bar() method, we need to first count the occurrences using the value_count() method and then sort the occurrences from smallest to largest using the sort_index() method.

Here is a great example of the bar plot in the COVID-19 dataset. First group data by country, then calculate the mean of confirmed cases and sort in ascending other. Finally, plot a bar chart of the 10 countries with the most number of confirmed cases.

screenshot of bar chart with the first 10 countries with highest confirmed cases

Scatter plot

Last but far from least is the scatter plot. Generally used to show the relationship between 2 columns or features in your dataset. Note that only numeric columns can be plotted.

Other Plots

There exist other interesting plots which I decided not to cover these on this post. However, I will mention these here with appropriate links for easy exploration.

* Line Chart

* Box plots

* Heatmap

* Pairplot

Bonus Tip

I came across this article published by towards data science written by Brenda Hali titled My Pandas Cheat Sheet and I found it very helpful. I think it contains very useful and handy pandas functions you will frequently find yourself in need of.

The Takeaway

No matter the dataset you came across on the internet, these simple data exploratory techniques should be the very first idea that runs to your mind. It will permit you to have a clue of what you are dealing with.

Understanding the contents, and relationships of a dataset will help you to decide if it is going to be useful for the project you are working on or its a total waste of time.

When you implement these pandas techniques outline on this post, the battle is already half when it has not started yet. Trust me, It will save you a lot of time and headaches.

Thanks for reading and hopefully it was helpful. If you have other interesting functions you think I and others should know about please, do not hesitate to leave a comment. Let’s learn and grow together.

If you loved this post, feel free to read some amazing post on my blog

References

Link to GitHub jupyter notebook with complete code.

link to Kaggle COVID-19 Dataset

Link to Iris Dataset

https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools

https://realpython.com/pandas-python-explore-dataset/
https://towardsdatascience.com/exploring-the-data-using-python-47c4bc7b8fa2
https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication