Handle Missing Data in Pyspark

Last Updated on July 24, 2023 by Editorial Team

The objective of this article is to understand various ways to handle missing or null values present in the dataset. A null means an unknown or missing or irrelevant value, but with machine learning or a data science aspect, it becomes essential to deal with nulls efficiently, the reason being an ML engineer can’t afford to get short on the dataset.

Let's check out various ways to handle missing data or Nulls in Spark Dataframe.

Pyspark connection and Application creation

import pyspark
from pyspark.sql import SparkSession
spark= SparkSession.builder.appName(‘NULL_Handling’).getOrCreate()
print(‘NULL_Handling’)

2. Import Dataset

null_df=spark.read.csv(r’D:\python_coding\pyspark_tutorial\Nulls.csv’,header=True,inferSchema=True)
null_df.show()

3. Dropping Null values

#na func to drop rows with null values
#rows having atleast a null value is droppednull_df.na.drop().show()

#rows having nulls greater than 2 are droppednull_df.na.drop(thresh=2).show()

4. Drop Nulls with ‘HOW’ argument

#drop rows having nulls using how parameter
#records having atleast a null wull be droppednull_df.na.drop(how=’any’).show()

#record having all nulls will be droppednull_df.na.drop(how=’all’).show()

5. Drop Nulls basis of a column

#dropping null values on basis of a column
null_df.na.drop(subset=[‘Sales’]).show()

#records having both Name and Sales as Nulls are droppednull_df.na.drop(how=’all’,subset=[‘Name’,’Sales’]).show()

#records having both Name and Sales as Nulls are droppednull_df.na.drop(how=’any’,subset=[‘Name’,’Sales’]).show()

6. Fill the Nulls

#filling null values into dataset
#spark automatically detects if a column is string or numeric
null_df.na.fill(‘NA’).show()

#fill integer value column
null_df.na.fill(0).show()

7. Filling Nulls on basis of column

#filling on basis of column namenull_df.na.fill(‘Name Missing’,subset=[‘Name’]).show()

#filling multiple column values basis of datatypesnull_df.na.fill({‘Name’: ‘Missing Name’, ‘Sales’: 0}).show()

8. Filling null columns with another column value

#fill null values in Name column with Id valuefrom pyspark.sql.functions import whenname_fill_df=null_df.select('ID','Name',
 when( null_df.Name.isNull(), null_df.Id).otherwise(null_df.Name).alias('Name_Filled'),'Sales')name_fill_df.show()

9. Filling nulls with mean or average

#filling numeric column values with the mean or average value of that particular columnfrom pyspark.sql.functions import mean
mean_val=null_df.select(mean(null_df.Sales)).collect()print(type(mean_val)) #mean_val is a list row objectprint('mean value of Sales', mean_val[0][0])
mean_sales=mean_val[0][0]#now using men_sales value to fill the nulls in sales column
null_df.na.fill(mean_sales,subset=['Sales']).show()

Summary:

· Drop null values

· Drop nulls with argument How

· Drop nulls with argument subset

· Fill the null values

· Fill the null column with another column value or with an average value

Hurray, here we have discussed several ways to deal with null values in a Spark data frame.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Handle Missing Data in Pyspark

Author(s): Vivek Chaudhary

Programming, Python

Pyspark connection and Application creation

2. Import Dataset

3. Dropping Null values

4. Drop Nulls with ‘HOW’ argument

5. Drop Nulls basis of a column

6. Fill the Nulls

7. Filling Nulls on basis of column

8. Filling null columns with another column value

9. Filling nulls with mean or average

Summary:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Handle Missing Data in Pyspark

Author(s): Vivek Chaudhary

Programming, Python

Pyspark connection and Application creation

2. Import Dataset

3. Dropping Null values

4. Drop Nulls with ‘HOW’ argument

5. Drop Nulls basis of a column

6. Fill the Nulls

7. Filling Nulls on basis of column

8. Filling null columns with another column value

9. Filling nulls with mean or average

Summary:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement