Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Exploratory Data Analysis (EDA) using Pyspark
Latest   Machine Learning

Exploratory Data Analysis (EDA) using Pyspark

Last Updated on July 24, 2023 by Editorial Team

Author(s): Vivek Chaudhary

Originally published on Towards AI.

Data Analytics, Python

The objective of this article is to perform analysis on the dataset and answer some questions to get the insight of data. We will learn how to connect to Oracle DB and create a Pyspark DataFrame and perform different operations to understand the various aspect of the dataset.
Exploratory Data Analysis or (EDA) is an understanding of the data sets by summarizing their main characteristics.

Data Analysis

As this is my first Blog on EDA, so I have tried to keep the content simple just to make sure I resonate with my readers. So without wasting further a minute lets get started with the analysis.

1. Pyspark connection and Application creation

import pyspark
from pyspark.sql import SparkSession
spark= SparkSession.builder.appName(β€˜Data_Analysis’).getOrCreate()

2. Pyspark DB connection and Import Datasets

#Import Sales Data
sales_df = spark.read.format(β€œjdbc”).option(β€œurl”, β€œjdbc:oracle:thin:sh/sh@//localhost:1521/orcl”).option(β€œdbtable”, β€œsales”).option(β€œuser”, β€œsh”).option(β€œpassword”, β€œsh”).option(β€œdriver”, β€œoracle.jdbc.driver.OracleDriver”).load()
#Import Customer Datacust_df = spark.read.format(β€œjdbc”).option(β€œurl”, β€œjdbc:oracle:thin:sh/sh@//localhost:1521/orcl”).option(β€œdbtable”, β€œcustomers”).option(β€œuser”, β€œsh”).option(β€œpassword”, β€œsh”).option(β€œdriver”, β€œoracle.jdbc.driver.OracleDriver”).load()#Import Products dataprod_df = spark.read.format(β€œjdbc”).option(β€œurl”, β€œjdbc:oracle:thin:sh/sh@//localhost:1521/orcl”).option(β€œdbtable”, β€œproducts”).option(β€œuser”, β€œsh”).option(β€œpassword”, β€œsh”).option(β€œdriver”, β€œoracle.jdbc.driver.OracleDriver”).load()#Import Channels Datachan_df = spark.read.format(β€œjdbc”).option(β€œurl”, β€œjdbc:oracle:thin:sh/sh@//localhost:1521/orcl”).option(β€œdbtable”, β€œchannels”).option(β€œuser”, β€œsh”).option(β€œpassword”, β€œsh”).option(β€œdriver”, β€œoracle.jdbc.driver.OracleDriver”).load()#Import Country datacountry_df = spark.read.format(β€œjdbc”).option(β€œurl”, β€œjdbc:oracle:thin:sh/sh@//localhost:1521/orcl”).option(β€œdbtable”, β€œcountries”).option(β€œuser”, β€œsh”).option(β€œpassword”, β€œsh”).option(β€œdriver”, β€œoracle.jdbc.driver.OracleDriver”).load()

ojdbc6 is required for Oracle DB connection

3. Display the data

#sales data
sales_df.show(10)
print('Total Records in dataset',sales_df.count())
Sales Data
#customer data
cust_df.show(5)
print(β€˜Total Records in dataset’,cust_df.count())
Customer Data
#product data
prod_df.show(5)
print(β€˜Total Records in dataset’,prod_df.count())
Products Data
#channels data
chan_df.show()
print(β€˜Total Records in dataset’,chan_df.count())
Channels data
#Country data
country_df.show(5)
print(β€˜Total Records in dataset’,country_df.count())
Country Data

4. Display schema and columns of DataFrame

#dataframe schema
sales_df.printSchema()
#display list of columns
sales_df.columns
Columns

5. Select and filter condition on DataFrame

#select some columns from product dataframe
prod_df.select(β€˜PROD_ID’,
β€˜PROD_NAME’,
β€˜PROD_DESC’,’PROD_STATUS’,
β€˜PROD_LIST_PRICE’,
β€˜PROD_MIN_PRICE’,’PROD_EFF_FROM’,
β€˜PROD_EFF_TO’,
β€˜PROD_VALID’).show(7)
select query
#filter condition with selective columns
country_df.select(β€˜COUNTRY_ID’,
β€˜COUNTRY_ISO_CODE’,
β€˜COUNTRY_NAME’,).filter(country_df.COUNTRY_NAME==’India’).show()
filter

Typecast Column_ID to convert Decimal data to Integer data.

from pyspark.sql.types import IntegerType
country_df.select(country_df[β€˜COUNTRY_ID’].cast(IntegerType()).alias(β€˜COUNTRY_ID’),
β€˜COUNTRY_ISO_CODE’,
β€˜COUNTRY_NAME’,).filter(country_df.COUNTRY_NAME==’India’).show()
Typecasting

6. GroupBy and Aggregation

Let's find out how a customer spend in a year and over the span of 4 years from 1998–2002 find out customer spending in an individual year.

from pyspark.sql.functions import dayofyear,year
from pyspark.sql.functions import round, col
sale_sum_df=sales_df.select(β€˜CUST_ID’,’TIME_ID’,’AMOUNT_SOLD’)cust_wise_df=sale_sum_df.groupBy(round(β€˜CUST_ID’,0).alias(β€˜CUST_ID’), year(sale_sum_df[β€˜TIME_ID’]).alias(β€˜YEAR’)).sum(β€˜AMOUNT_SOLD’)cust_wise_df.show(10)
Aggregated Cust Sales Data

7. Data Sorting

#Sort the records on basis of 
cust_wise_df.orderBy(cust_wise_df.CUST_ID).show(15)
Sorted Data

Lets check Year wise Customer spending.

cust_wise_df.filter(cust_wise_df.CUST_ID==3261).show()
Year-wise spending

8. Number of Customer visits over the time

sale_sum_df.groupBy(sale_sum_df[β€˜CUST_ID’].cast(IntegerType()).alias(β€˜CUST_ID’)).count().show(10)
Cust visits

9. Total Sale of a Product over the span of 4 years


s_df=sales_df.select(round(β€˜PROD_ID’,0).alias(β€˜PROD_ID’),year(sale_sum_df[β€˜TIME_ID’]).alias(β€˜YEAR’),’AMOUNT_SOLD’)
#withColumnRenamed changes column name
s_df=s_df.withColumnRenamed(β€˜PROD_ID’,’PROD_ID_S’)
p_df=prod_df.select(β€˜PROD_ID’,’PROD_NAME’)
p_df=p_df.withColumnRenamed(β€˜PROD_ID’,’PROD_ID_P’)
#join the above two dataframes created
prod_sales_df=s_df.join(p_df,p_df.PROD_ID_P==s_df.PROD_ID_S,how='inner')
#perform groupby and aggregation to sum the sales amount productwise
prod_sales=prod_sales_df.groupBy('PROD_ID_S','PROD_NAME').sum('AMOUNT_SOLD')
prod_sales=prod_sales.select(col('PROD_ID_S').alias('PROD_ID'),'PROD_NAME',col('sum(AMOUNT_SOLD)').alias('TOTAL_SALES'))
prod_sales.show(10)
Product Wise Total sales

10. Channel wise Total Sales

#find out which channel contributed most to the salesc_df=chan_df.select(col(β€˜CHANNEL_ID’).alias(β€˜CHANNEL_ID_C’),col(β€˜CHANNEL_DESC’).alias(β€˜CHANNEL_NAME’))sa_df=sales_df.select(col(β€˜CHANNEL_ID’).alias(β€˜CHANNEL_ID_S’),’AMOUNT_SOLD’)chan_sales_df=sa_df.join(c_df,c_df.CHANNEL_ID_C==sa_df.CHANNEL_ID_S,how=’inner’)
chan_sale=chan_sales_df.groupBy(round(β€˜CHANNEL_ID_C’,0).alias(β€˜CHANNEL_ID’)).sum(β€˜AMOUNT_SOLD’)
chan_top_sales=chan_sale.withColumnRenamed(β€˜sum(AMOUNT_SOLD)’,’TOT_AMOUNT’)chan_top_sales.orderBy(β€˜CHANNEL_ID’).show()
Channel wise sales

Summary

Β· Pyspark DB connectivity

Β· Data display using show()

Β· Schema and columns of Dataframe

Β· Apply select and filter condition on DFs

Β· GroupBy and Aggregation

Β· Column renames

Β· Some Data Insights

Hurray, here we completed Exploratory Data Analysis using Pyspark and tried to make data look sensible. In upcoming articles on Data Analysis, I will share some more Pyspark functionalities.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓