Towards AI

Exploratory Data Analysis: Baby Steps

Author(s): Swetha Lakshmanan

Image for post

Steps in Data Exploration and Preprocessing:

Dataset:

Variable identification:

Classification of Variables
Unique ID, disbursed_amount, asset_cost, ltv, Current_pincode_ID, PERFORM_CNS.SCORE, PERFORM_CNS.SCORE.DESCRIPTION, PRI.NO.OF.ACCTS, PRI.ACTIVE.ACCTS, PRI.OVERDUE.ACCTS, PRI.CURRENT.BALANCE, PRI.SANCTIONED.AMOUNT, PRI.DISBURSED.AMOUNT, NO.OF_INQUIRIES
branch_id, supplier_id, manufacturer_id, Date.of.Birth, Employment.Type, DisbursalDate, State_ID, Employee_code_ID, MobileNo_Avl_Flag, Aadhar_flag, PAN_flag, VoterID_flag, Driving_flag, Passport_flag, loan_default

Importing Libraries:

#importing libraries 
import pandas as pd 
import numpy as np
import matplotlib as plt 
import seaborn as sns 

Importing Dataset:

train = pd.read_csv("train.csv")

Identification of data types:

train.dtypes
A snippet of output for the above code

Size of the dataset:

train.shape

Statistical Summary of Numeric Variables:

train.describe()
A snippet of output for the above code

Non-Graphical Univariate Analysis:

To get the count of unique values:

train['loan_default'].value_counts()

To get the list & number of unique values:

train['branch_id'].nunique()
train['branch_id'].unique()

Filtering based on Conditions:

train[(train['Employment.Type'] == "Salaried")]
A snippet of output for the above code
train[(train['Employment.Type'] == "Salaried") & (train['branch_id'] == 100)]
A snippet of output for the above code

Finding null values:

train.apply(lambda x: sum(x.isnull()),axis=0)
A snippet of output for the above code
train['Date.of.Birth']= pd.to_datetime(train['Date.of.Birth'])
train['ltv'] = train['ltv'].astype('int64')

Graphical Univariate Analysis:

Histogram:

train['ltv'].hist(bins=25)
train['asset_cost'].hist(bins=200)

Box Plots:

print(train.boxplot(column='disbursed_amount'))
train.boxplot(column=’disbursed_amount’, by = ‘Employment.Type’)
sns.boxplot(x=train['asset_cost'])

Count Plots:

sns.countplot(train.loan_default)
sns.countplot(train.manufacturer_id)
Exit mobile version