Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Follow publication

Image by the Author | All images are from the author(s) unless stated otherwise.

You're reading for free via Towards AI Editorial Team's Friend Link. Become a member to access the best of Medium.

Member-only story

DATA SCIENCE, EDITORIAL, PROGRAMMING

Handling Missing Values in Pandas

A hands-on visual tutorial on how to detect and handle missing data in pandas

Towards AI Editorial Team
Towards AI
Published in
18 min readApr 29, 2021

--

Author(s): Pratik Shukla, Roberto Iriondo

The most crucial and time-consuming part of any data science project is data cleansing and preparation. Thankfully, there are many powerful tools available that help us expedite this process.

The pandas’ library is one of the widely used data analysis libraries in python. Before using our models to perform data analysis on our data, it is critical to find any missing values that may affect our outputs.

Missing data occurs when a user being surveyed does not share their data. This tutorial will dive into a few methods that will help us identify and remove such missing data with the help of pandas.

The companion materials for this tutorial can be found under our resources section.

Table of Contents:

  1. pd.isna()
  2. pd.notna()
  3. pd.isnull()
  4. pd.notnull()
  5. pd.dropna()
  6. pd.fillna()
  7. pd.bfill()
  8. pd.backfill()
  9. pd.ffill()
  10. pd.pad()
  11. Closing Remarks
  12. Resources
  13. References

pd.isna( ):

We use the pd.isna() function of the pandas' library to detect missing values for an array-like object. Let us first see the syntax of pd.isna() and understand it with examples.

Syntax and Parameter Explanation of pandas.isna( ) Function

Before we move on to understand how the pd.isna() function works, let us first import some required libraries.

Importing Required Libraries

A. Example — 1:

Scalar Arguments (Number)

B. Example — 2:

Scalar Arguments (String)

C. Example — 3:

Please note that empty strings are not considered NA values. That is why the output of pd.isna(“ ”) will be False.

Empty String

D. Example — 4:

NaN — Not a Number

E. Example — 5:

inf — Infinity

F. Example — 6:

None

G. Example — 7:

NA — Not Available

H. Example — 8:

NaT — Not a Timestamp

N-Dimensional Arrays:

I. Example — 9:

1-Dimensional Array

J. Example — 10:

2-Dimensional Array

Index Values:

K. Example — 11:

Index values (1-Dimensional)

L. Example — 11:

Timestamp Values (1-Dimensional)

Pandas Series:

M. Example — 13:

Pandas Series

Pandas DataFrame:

N. Example — 14:

Pandas DataFrame

pd.notna( ):

The pd.notna() function of the pandas' library is used to detect non-missing or valid values for an array-like object. Please note that pd.notna() is the boolean inverse of pd.isna() function of the pandas' library. Let us first see the syntax of pd.isna() and understand it with examples.

Syntax and Parameter Explanation for pandas.notna( ) Function

Before we move on to understand how the pd.isna() function works, let us first import some required libraries.

Importing Required Libraries

A. Example — 1:

Scalar Argument (Number)

B. Example — 2:

Scalar Argument (String)

C. Example — 3:

Empty String

D. Example — 4:

inf — Infinity

E. Example — 5:

NaN — Not a Number

F. Example — 6:

None

G. Example — 7:

NA — Not Available

H. Example — 8:

NaT — Not a Timestamp

I. Example — 9:

1-Dimensional Array

J. Example — 10:

2-Dimensional Array

K. Example — 11:

Index Values

L. Example — 12:

Timestamp Values

M. Example — 13:

Pandas Series

N. Example — 14:

Pandas DataFrame

Important Note:

  1. The pd.isnull() function is an alias of pd.isna() function. It will give us the exact same results. It is recommended to use the pd.isna() function instead of pd.isnull() function.
pd.isnull( ) Function

2. The pd.notnull() function is an alias of pdnotna() function. It will give us the same results. It is recommended to use thepd.notna() function instead of pd.notnull() function.

pd.notnull( ) Function

To replicate this tutorial, please run the Google Colab notebook.

pd.dropna( ):

We use the pd.dropna() function of the pandas' library to remove missing values. Let us first see its syntax and parameters to have a better idea about its functionalities.

Syntax and Parameter Explanation of pandas.dropna( ) Function

Let’s take a few examples to understand how exactly the parameters of pd.dropna() function affects the output.

Before diving deeper into the pd.dropna() function, let us first create a DataFrame to work with.

A. Create a DataFrame:

Main DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

If we do not specify any parameter for the pd.dropna() function, it will delete all the rows with at least one missing element.

Parameters Used:

None

pd.dropna( ) without any parameters

Python Implementation:

Python Code

C. Example — 2:

If we specify the parameter axis=0, it will delete all the rows with at least one missing element. This deletion is the default behavior.

Parameters Used:

axis = 0

pd.dropna( ) with axis=0

Python Implementation:

Python Code

D. Example — 3:

Instead of specifying axis=0, we can also specify axis="row” as a parameter. It will work the same way.

Parameters Used:

axis = “rows”

pd.dropna( ) with axis=rows

Python Implementation:

Python Code

E. Example — 4:

If we specify the parameter axis=1, it will delete all the columns with at least one missing element.

Parameters Used:

axis = 1

pd.dropna( ) with axis=1

Python Implementation:

Python Code

F. Example — 5:

Instead of specifying axis=1, we can also specify axis=“columns”as a parameter. It will work the same way.

Parameters Used:

axis = “columns”

pd.dropna( ) with axis=columns

Python Implementation:

Python Code

G. Example — 6:

If we specify how="any” as a parameter, it will remove rows with at least one missing element. In short, it will remove the rows if it has any missing elements. If we want to perform this operation on columns, we have to use the axis parameter.

Parameters Used:

how = “any”

pd.dropna( ) with how=”any”

Python Implementation:

Python Code

H. Creating a DataFrame:

Creating a New DataFrame

Python Implementation:

Python Code
Output

I. Example — 7:

If we specify how="all” as a parameter, it will remove rows in which all elements are missing. In short, it will remove the rows if it has all missing elements. If we want to perform this operation on columns, we have to use the axis parameter.

Parameters Used:

how = “all”

pd.dropna( ) with how=”all”

Python Implementation:

Python Code

J. Example — 8:

If we specify the thresh parameter, it will only keep the rows that have non-missing elements of at least the number specified by the thresh parameter. In the following example, we can see that we have specified thresh=5 , it means that it will keep only those rows that have 5 non-missing elements.

Parameters Used:

thresh = 5

pd.dropna( ) with thresh=5

Python Implementation:

Python Code

K. Create a DataFrame:

Creating a New DataFrame

Python Implementation:

Python Code
Output

L. Example — 9:

If we only want to consider a subset of columns to find and drop the missing elements, we can use the subset parameter to specify the column names in which we want to look for the missing elements. In the following example, it will only look in “Person”, “Degree”, “Country” columns to find the missing values. Missing values in other columns will not affect the final output.

Parameters Used:

subset = [“Person”, “Degree”, “Country”]

pd.dropna( ) with subset=[“Person”,”Degree”,”Country”]

Python Implementation:

Python Code

M. Create a DataFrame:

Create a New DataFrame

Python Implementation:

Python Code
Output

N. Example — 10:

If we want the changes to occur in our original DataFrame, we have to specify inplace=True as a parameter. Note that it will not return anything. After execution, the original DataFrame will be modified by the result of the pd.dropna() function.

Parameters Used:

inplace = True

pd.dropna( ) with inplace=True

Python Implementation:

Python Code

pd.fillna( ):

The pd.fillna() function of the pandas' library is used to fill the missing values using a specific method. Let us first see its syntax and parameters to understand it in a better way.

Syntax and Parameter Explanation for pandas.fillna( ) Function

Let’s take a few examples to understand how the parameter values affect the output.

Before we dive deeper into the pd.fillna() function, let us first create a DataFrame to work with.

A. Create a DataFrame:

Create a DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

We can use the value parameter to specify by which value we want to fill the missing elements. In the following example, we are specifying value=0So it will fill all the missing elements with 0.

Parameters Used:

value = 0

pd.fillna( ) with value=0

Python Implementation:

Python Code
Output

C. Example — 2:

We can also specify different values to fill the missing elements for different columns by using the value parameter. The following example demonstrates how we can perform this operation.

Parameters Used:

value = dictionary

pd.fillna( ) with a dictionary of values

Python Implementation:

Python Code
Output

D. Example — 3:

To fill the missing elements, we can use the method parameter. If we specify method=”ffill”, it will use the last valid observation to fill the gap. If we do not specify the axis value, it will perform the operation row-wise or with axis=0. Please note that there is no limit to propagate the last valid observation to fill the gaps. If there are multiple consecutive missing elements, they will get filled by the last valid observation.

Important Note:

If we specify method=”ffill” and the axis=0, and if the elements in the first row are missing, they will never get filled.

Parameters Used:

method = “ffill”

data.fillna( ) with method=”ffill”

Python Implementation:

Python Code
Output

E. Example — 4:

If we specify method=”pad”, it works the same way as method=”ffill”.

Parameters Used:

method = “pad”

pd.fillna( ) with method=”pad”

Python Implementation:

Python Code
Output

F. Example — 5:

By default, the missing elements will be filled row-wise or with axis=0.

Important Note:

If we specify method=”ffill” and the axis=0, then if the elements in the first row are missing, they will never get filled.

Parameters Used:

method = “ffill”

axis = 0

pd.fillna( ) with method=”ffill” and axis=0

Python Implementation:

Python Code
Output

G. Example — 6:

In some cases, if we want to fill missing the elements column-wise, we can specify the axis parameter and set axis=1.

Important Note:

If we specify method=”ffill” and the axis=1, then if the elements in the first column are missing, they will never get filled.

Parameters Used:

method = “ffill”

axis = 1

pd.fillna( ) with method=”ffill” and axis=1

Python Implementation:

Python Code
Output

H. Example — 7:

To fill the missing elements, we can use the method parameter. If we specify method=”bfill”, it will use the next valid observation to fill the gap. If we do not specify the axis value, it will perform the operation row-wise or with axis=0. Please note that there is no limit to propagate the next valid observation to fill the gaps. If there are multiple consecutive missing elements, they will get filled by the next valid observation.

Important Note:

If we specify method=”bfill” and the axis=0, then if the elements in the last row are missing, they will never get filled.

Parameters Used:

method = “bfill”

axis = 0

pd.fillna( ) with method=”bfill”

Python Implementation:

Python Code
Output

I. Example — 8:

If we specify method=”backfill”, it works the same way as method=”bfill”.

Parameters Used:

method = “backfill”

pd.fillna( ) with method=”backfill”

Python Implementation:

Python Code
Output

J. Example — 9:

By default, the missing elements will be filled row-wise or with axis=0.

Important Note:

If we specify method=”bfill” and the axis=0, then if the elements in the last row are missing, they will never get filled.

Parameters Used:

method = “bfill”

axis = 0

pd.fillna( ) with method=”bfill” and axis=0

Python Implementation:

Python Code
Output

K. Example — 10:

In some cases, if we want to fill missing the elements column-wise, we can specify the axis parameter and set axis=1.

Important Note:

If we specify method=”bfill” and the axis=1, then if the elements in the last column are missing, they will never get filled.

Parameters Used:

method = “ffill”

axis = 1

pd.fillna( ) with method=”bfill” and axis=1

Python Implementation:

Python Code
Output

L. Example — 11:

If we specify the limit parameter, it will restrict the maximum number of consecutive missing values to be filled in forward or backward fill methods. We can say that if the gap of consecutive missing elements is more than the number specified by the limitparameter, it will only be filled partially. Here we are using the fill forward method with axis=0 and a limit of 1 element.

Parameters Used:

method = “ffill”

axis = 0

limit = 1

pd.fillna( ) with method=”ffill” and axis=0 and limit=1

Python Implementation:

Python Code
Output

M. Example — 12:

In this example, we will use the fill forward method with axis=1 and a limit of 1 element.

Parameters Used:

method = “ffill”

axis = 1

limit = 1

pd.fillna( ) with method=”ffill” and axis=1 and limit=1

Python Implementation:

Python Code
Output

N. Example — 13:

In this example, we will use the backward fill method with axis=0 and a limit of 1 element.

Parameters Used:

method = “bfill”

axis = 0

limit = 1

pd.fillna( ) with method=”bfill” and axis=0 and limit=1

Python Implementation:

Python Code
Output

O. Example — 12:

In this example, we will use the backward fill method with axis=1 and a limit of 1 element.

Parameters Used:

method = “bfill”

axis = 1

limit = 1

pd.fillna( ) with method=”bfill” and axis=1 and limit=1

Python Implementation:

Python Code
Output

P. Creating a DataFrame:

Creating a New DataFrame

Python Implementation:

Python Code
Output
Datatypes

Q. Example — 13:

We can use the downcast parameter to downcast the datatype if possible. The string value “infer” will try to downcast to an appropriate equal type. For example, float64 to int64.

Parameters Used:

downcast = infer

pd.fillna( ) with value=0 and downcast=”infer”

Python Implementation:

Python Code

R. Example — 14:

If we want the changes to take place in our original DataFrame, then we have to specify inplace=True as a parameter. Note that it will not return anything. After execution, the original DataFrame will be modified by the result of pd.dropna() function.

Parameters Used:

inplace = True

pd.fillna( ) with values=0 and inplace=True

Python Implementation:

Python Code
Output

pd.DataFrame.bfill( ):

The pd.DataFrame.bfill() function works exactly the same way as the pd.fillna() function works with the parameter method=”bfill”.

Let us take an example to understand it.

A. Create a DataFrame:

Create a DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

Parameters Used:

None

pd.bfill( )

Python Implementation:

Python Code
Output

pd.DataFrame.backfill( ):

The pd.DataFrame.backfill() function works the same way as the pd.fillna() function works with the parameter method=”backfill”.

Syntax and Parameter Explanation for the pd.DataFrame.backfill( ) Function

Let us take an example to understand how it works.

A. Create a DataFrame:

Create a DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

Parameters Used:

None

pd.backfill( )

Python Implementation:

Python Code
Output

pd.DataFrame.ffill( ):

The pd.DataFrame.ffill() function works exactly the same way as the pd.fillna() function works with the parameter method=”ffill”.

Syntax and Parameter Explanation for pd.DataFrame.ffill( ) Function

Let’s take an example to understand it better.

A. Create a DataFrame:

Create a DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

Parameters Used:

None

pd.ffill( ) Function

Python Implementation:

Python Code
Output

pd.DataFrame.pad( ):

The pd.DataFrame.pad() function works the same way as the pd.fillna() function works with the parameter method=”pad”.

Syntax and Parameter Explanation for pd.DataFrame.pad( ) Function

Let’s take an example to understand it better.

A. Create a DataFrame:

Create a DataFrame

Python Implementation:

Python Code
Output

B. Example — 1:

Parameters Used:

None

pd.pad( ) Function

Python Implementation:

Python Code
Output

Closing Remarks:

We hope you enjoyed reading this piece and learned something new about handling missing data.

Buy Pratik a Coffee!

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of any company (directly or indirectly) associated with the author(s). This work does not intend to be a final product, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

All images are from the author(s) unless stated otherwise.

Published via Towards AI

Resources

References

  1. “Pandas.Dataframe.Backfill — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.backfill.html.
  2. “Pandas.Dataframe.Dropna — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html.
  3. “Pandas.Dataframe.Pad — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pad.html.
  4. “Pandas.Dataframe.Notnull — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notnull.html.
  5. “Pandas.Dataframe.Notna — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.notna.html.
  6. “Pandas.Dataframe.Isnull — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isnull.html.
  7. “Pandas.Dataframe.Isna — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html.
  8. “Pandas.Dataframe.Fillna — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html.
  9. “Pandas.Dataframe.Ffill — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html.
  10. “Pandas.Dataframe.Dropna — Pandas 1.2.4 Documentation”. 2021. Pandas.Pydata.Org. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html.

Published in Towards AI

The leading AI community and content platform focused on making AI accessible to all. Check out our new course platform: https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev

Written by Towards AI Editorial Team

The leading AI community & content platform making AI accessible to all. | 2.5k writers, 60k Discord, 500 k followers

No responses yet

Write a response