Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Handle Missing Data in Pyspark
Latest   Machine Learning

Handle Missing Data in Pyspark

Last Updated on July 24, 2023 by Editorial Team

Author(s): Vivek Chaudhary

Originally published on Towards AI.

Programming, Python

The objective of this article is to understand various ways to handle missing or null values present in the dataset. A null means an unknown or missing or irrelevant value, but with machine learning or a data science aspect, it becomes essential to deal with nulls efficiently, the reason being an ML engineer can’t afford to get short on the dataset.

Nulls

Let's check out various ways to handle missing data or Nulls in Spark Dataframe.

Pyspark connection and Application creation

import pyspark
from pyspark.sql import SparkSession
spark= SparkSession.builder.appName(β€˜NULL_Handling’).getOrCreate()
print(β€˜NULL_Handling’)

2. Import Dataset

null_df=spark.read.csv(r’D:\python_coding\pyspark_tutorial\Nulls.csv’,header=True,inferSchema=True)
null_df.show()
Dataset

3. Dropping Null values

#na func to drop rows with null values
#rows having atleast a null value is dropped
null_df.na.drop().show()
drop nulls
#rows having nulls greater than 2 are droppednull_df.na.drop(thresh=2).show()
drop nulls

4. Drop Nulls with β€˜HOW’ argument

#drop rows having nulls using how parameter
#records having atleast a null wull be dropped
null_df.na.drop(how=’any’).show()
any
#record having all nulls will be droppednull_df.na.drop(how=’all’).show()
all

5. Drop Nulls basis of a column

#dropping null values on basis of a column
null_df.na.drop(subset=[β€˜Sales’]).show()
subset
#records having both Name and Sales as Nulls are droppednull_df.na.drop(how=’all’,subset=[β€˜Name’,’Sales’]).show()
#records having both Name and Sales as Nulls are droppednull_df.na.drop(how=’any’,subset=[β€˜Name’,’Sales’]).show()
subset any

6. Fill the Nulls

#filling null values into dataset
#spark automatically detects if a column is string or numeric
null_df.na.fill(β€˜NA’).show()
Fill NA
#fill integer value column
null_df.na.fill(0).show()
Fill 0

7. Filling Nulls on basis of column

#filling on basis of column namenull_df.na.fill(β€˜Name Missing’,subset=[β€˜Name’]).show()
Name Missing
#filling multiple column values basis of datatypesnull_df.na.fill({β€˜Name’: β€˜Missing Name’, β€˜Sales’: 0}).show()

8. Filling null columns with another column value

#fill null values in Name column with Id valuefrom pyspark.sql.functions import whenname_fill_df=null_df.select('ID','Name',
when( null_df.Name.isNull(), null_df.Id).otherwise(null_df.Name).alias('Name_Filled'),'Sales')
name_fill_df.show()

9. Filling nulls with mean or average

#filling numeric column values with the mean or average value of that particular columnfrom pyspark.sql.functions import mean
mean_val=null_df.select(mean(null_df.Sales)).collect()
print(type(mean_val)) #mean_val is a list row objectprint('mean value of Sales', mean_val[0][0])
mean_sales=mean_val[0][0]
#now using men_sales value to fill the nulls in sales column
null_df.na.fill(mean_sales,subset=['Sales']).show()

Summary:

Β· Drop null values

Β· Drop nulls with argument How

Β· Drop nulls with argument subset

Β· Fill the null values

Β· Fill the null column with another column value or with an average value

Hurray, here we have discussed several ways to deal with null values in a Spark data frame.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓