Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Exploratory Data Analysis: Baby Steps
Data Analysis

Exploratory Data Analysis: Baby Steps

Last Updated on November 18, 2020 by Editorial Team

Author(s): Swetha Lakshmanan

Image for post

Steps in Data Exploration and Preprocessing:

Dataset:

Variable identification:

Image for post
Classification of Variables
Unique ID, disbursed_amount, asset_cost, ltv, Current_pincode_ID, PERFORM_CNS.SCORE, PERFORM_CNS.SCORE.DESCRIPTION, PRI.NO.OF.ACCTS, PRI.ACTIVE.ACCTS, PRI.OVERDUE.ACCTS, PRI.CURRENT.BALANCE, PRI.SANCTIONED.AMOUNT, PRI.DISBURSED.AMOUNT, NO.OF_INQUIRIES
branch_id, supplier_id, manufacturer_id, Date.of.Birth, Employment.Type, DisbursalDate, State_ID, Employee_code_ID, MobileNo_Avl_Flag, Aadhar_flag, PAN_flag, VoterID_flag, Driving_flag, Passport_flag, loan_default

Importing Libraries:

#importing libraries 
import pandas as pd 
import numpy as np
import matplotlib as plt 
import seaborn as sns 

Importing Dataset:

train = pd.read_csv("train.csv")

Identification of data types:

train.dtypes
Image for post
A snippet of output for the above code

Size of the dataset:

train.shape

Statistical Summary of Numeric Variables:

train.describe()
Image for post
A snippet of output for the above code

Non-Graphical Univariate Analysis:

To get the count of unique values:

train['loan_default'].value_counts()
Image for post

Image for post

To get the list & number of unique values:

train['branch_id'].nunique()
train['branch_id'].unique()
Image for post

Filtering based on Conditions:

train[(train['Employment.Type'] == "Salaried")]
Image for post
A snippet of output for the above code
train[(train['Employment.Type'] == "Salaried") & (train['branch_id'] == 100)]
Image for post
A snippet of output for the above code

Finding null values:

train.apply(lambda x: sum(x.isnull()),axis=0)
Image for post
A snippet of output for the above code
train['Date.of.Birth']= pd.to_datetime(train['Date.of.Birth'])
train['ltv'] = train['ltv'].astype('int64')

Graphical Univariate Analysis:

Histogram:

train['ltv'].hist(bins=25)
Image for post
train['asset_cost'].hist(bins=200)
Image for post

Box Plots:

Image for post
print(train.boxplot(column='disbursed_amount'))
Image for post
train.boxplot(column=’disbursed_amount’, by = ‘Employment.Type’)
Image for post
sns.boxplot(x=train['asset_cost'])
Image for post

Count Plots:

sns.countplot(train.loan_default)
Image for post
sns.countplot(train.manufacturer_id)
Image for post

Comments (3)

  1. Jose Luis Beltramone
    May 18, 2021

    Thank you very much for sharing your knowledge in this excellent article. Have you issued the following others covering the next steps in EDA?
    Thank you very much / JLB.

  2. Shankar wagh
    June 1, 2021

    Great Content, well structured. beginner always don’t know how to do EDA but this article give help to anyone.
    Pls post second part early

  3. sahil
    July 21, 2021

    Thanks for sharing this knowledgeable article on EDA. I am waiting for next step of this article