Descriptive Analysis

Last Updated on June 11, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

Descriptive analytics is a big part of working with data, it could be before more complicated tasks, such as machine learning or on its own.

In this example I illustrate the use of deep descriptive analysis, using latera regression model to measure variable importance using a dataset of characteristic of the countries of origin of billionaires.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import klib
import geopandas as geo
import dabl
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

maindataset = pd.read_csv("Billionaires_Statistics_Dataset.csv")
maindataset

The first step is some cleanup of the data, to make the numerical values fit into numbers, and other formatting

maindataset["gdp_country"] = maindataset["gdp_country"].astype(str).str.replace('\$','').astype(str).str.replace(' ','').astype(str).str.replace(',','')
maindataset["gdp_country"] = pd.to_numeric(maindataset["gdp_country"], errors='coerce')
maindataset["selfMade"] = maindataset["selfMade"].astype(str)

Now, I am ready to display the distribution for numerical variables, using the klib package: klib · PyPI

klib.dist_plot(maindataset['finalWorth'])
klib.dist_plot(maindataset['age'])
klib.dist_plot(maindataset['cpi_country'])
klib.dist_plot(maindataset['cpi_change_country'])
klib.dist_plot(maindataset['gdp_country'])
klib.dist_plot(maindataset['gross_tertiary_education_enrollment'])
klib.dist_plot(maindataset['gross_primary_education_enrollment_country'])
klib.dist_plot(maindataset['life_expectancy_country'])
klib.dist_plot(maindataset['tax_revenue_country_country'])
klib.dist_plot(maindataset['population_country'])

I like that these results came up since it allows me to explain a couple of things.

If we look at the current state here, we could conclude that billionaires appear in countries that are very rich or very poor, or small and big, and not in medium population or revenue.

Well, that conclusion will be drawn on incomplete data, and so probably inaccurate, since we don’t have in this dataset information to obtain the distribution of the data of all countries, as well that many billionaires could come from the same country, which skews the data in favor of a value (we will see that when analyzing the categorical variables).

It is important not to jump to conclusions based on incomplete data. The value of our analysis is limited always by the availability of data, and we need to make that point clear to the stakeholders.

It is the responsibility of a good analyst to always describe the limits of the data and the assumptions made when writing their reports.

Moving to the description of categorical variables

klib.cat_plot(maindataset[['category','country','selfMade','gender']], top=5, bottom=5)

This map shows the location of billionaires showing the total wealth they hold by country

locationdf = maindataset.groupby(["longitude_country","latitude_country"])["personName"].count()
locationdf = locationdf.reset_index(drop=False)

fig, ax = plt.subplots(figsize=(20,10))
countries = geo.read_file(geo.datasets.get_path("naturalearth_lowres"))
countries.plot(color="lightgrey", ax=ax)
locationdf.plot(x="longitude_country", y="latitude_country", kind="scatter",c="personName", colormap="YlOrRd",title=f"Location of billionaires",ax=ax)
plt.show()

It is interesting to see the same data as percentage of the country’s GDP.

Supposing the target is the billionaire’s wealth, let’s correlate the numerical features to it and see how they perform

maindatasetC = maindataset[['finalWorth', 'category', 'age', 'country',
 'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
 'gdp_country', 'gross_tertiary_education_enrollment',
 'gross_primary_education_enrollment_country', 'life_expectancy_country',
 'tax_revenue_country_country', 'total_tax_rate_country',
 'population_country', 'latitude_country', 'longitude_country']]

klib.corr_plot(maindatasetC, target='finalWorth')

The fact that latitude is number 5 shows that there is a different between the global north and south. Again, this difference is even worse in real life, since here we are missing the data of the countries with NO billionaires.

I am making box plots for the categorical values

plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="category")
plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="country")

This shows the countries and industries where the billionaires, if they exist, tend to be richer.

It does not mean that the people in those industries and countries are richer in any way, or says anything about the number of billionaires. Those are different data points.

If we want to find the predictors of a billionaire’s wealth, I can use a random forest and then explain the variable weight in the model, but first we need to transform (encode) the columns so that the models can read them

num_columns = ['age', 'cpi_country', 'cpi_change_country', 'gdp_country', 'gross_tertiary_education_enrollment', 'gross_primary_education_enrollment_country',
 'life_expectancy_country', 'tax_revenue_country_country', 'total_tax_rate_country', 'population_country', 'latitude_country', 'longitude_country']
cat_columns = ['category', 'country', 'selfMade', 'gender']
cat_preprocessor = OneHotEncoder(handle_unknown="ignore")
num_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
 [
 ("one-hot-encoder", cat_preprocessor, cat_columns),
 ("standard_scaler", num_preprocessor, num_columns),
 ])

train = maindatasetC.copy()
train = train.dropna()

X_train = train[['category', 'age', 'country',
 'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
 'gdp_country', 'gross_tertiary_education_enrollment',
 'gross_primary_education_enrollment_country', 'life_expectancy_country',
 'tax_revenue_country_country', 'total_tax_rate_country',
 'population_country', 'latitude_country', 'longitude_country']]
Y_train = train[['finalWorth']]

X = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(X_train))
catnames = preprocessor.transformers_[0][1].get_feature_names_out(cat_columns).tolist()
numnames = preprocessor.transformers_[1][1].get_feature_names_out(num_columns).tolist()
featnames = catnames + numnames

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X, Y_train)

And plot the variable importance in the model

imp = rf.feature_importances_
imp = pd.Series(imp, index=featnames)
std = pd.Series(np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0), index=featnames)
fig, ax = plt.subplots()
imp.plot(kind='barh', yerr=std, ax=ax, figsize=(15,15))
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Descriptive Analysis

Author(s): Greg Postalian-Yrausquin

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Fine-Tuning vs Distillation vs Transfer Learning: What’s The Difference?

#63: Full of Frameworks: APDTFlow, NSGM, MLFlow, and more!

Vector Databases 101: A Beginner’s Guide to Vector Search and Indexing

AI Agent Developer: A Journey Through Code, Creativity, and Curiosity

AlphaGeometry2: A Deep Dive into a Gold-Medalist AI Geometry Solver

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Descriptive Analysis

Author(s): Greg Postalian-Yrausquin

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement