Descriptive Analysis
Last Updated on June 11, 2024 by Editorial Team
Author(s): Greg Postalian-Yrausquin
Originally published on Towards AI.
Descriptive analytics is a big part of working with data, it could be before more complicated tasks, such as machine learning or on its own.
In this example I illustrate the use of deep descriptive analysis, using latera regression model to measure variable importance using a dataset of characteristic of the countries of origin of billionaires.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import klib
import geopandas as geo
import dabl
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
maindataset = pd.read_csv("Billionaires_Statistics_Dataset.csv")
maindataset
The first step is some cleanup of the data, to make the numerical values fit into numbers, and other formatting
maindataset["gdp_country"] = maindataset["gdp_country"].astype(str).str.replace('\$','').astype(str).str.replace(' ','').astype(str).str.replace(',','')
maindataset["gdp_country"] = pd.to_numeric(maindataset["gdp_country"], errors='coerce')
maindataset["selfMade"] = maindataset["selfMade"].astype(str)
Now, I am ready to display the distribution for numerical variables, using the klib package: klib Β· PyPI
klib.dist_plot(maindataset['finalWorth'])
klib.dist_plot(maindataset['age'])
klib.dist_plot(maindataset['cpi_country'])
klib.dist_plot(maindataset['cpi_change_country'])
klib.dist_plot(maindataset['gdp_country'])
klib.dist_plot(maindataset['gross_tertiary_education_enrollment'])
klib.dist_plot(maindataset['gross_primary_education_enrollment_country'])
klib.dist_plot(maindataset['life_expectancy_country'])
klib.dist_plot(maindataset['tax_revenue_country_country'])
klib.dist_plot(maindataset['population_country'])
I like that these results came up since it allows me to explain a couple of things.
If we look at the current state here, we could conclude that billionaires appear in countries that are very rich or very poor, or small and big, and not in medium population or revenue.
Well, that conclusion will be drawn on incomplete data, and so probably inaccurate, since we donβt have in this dataset information to obtain the distribution of the data of all countries, as well that many billionaires could come from the same country, which skews the data in favor of a value (we will see that when analyzing the categorical variables).
It is important not to jump to conclusions based on incomplete data. The value of our analysis is limited always by the availability of data, and we need to make that point clear to the stakeholders.
It is the responsibility of a good analyst to always describe the limits of the data and the assumptions made when writing their reports.
Moving to the description of categorical variables
klib.cat_plot(maindataset[['category','country','selfMade','gender']], top=5, bottom=5)
This map shows the location of billionaires showing the total wealth they hold by country
locationdf = maindataset.groupby(["longitude_country","latitude_country"])["personName"].count()
locationdf = locationdf.reset_index(drop=False)
fig, ax = plt.subplots(figsize=(20,10))
countries = geo.read_file(geo.datasets.get_path("naturalearth_lowres"))
countries.plot(color="lightgrey", ax=ax)
locationdf.plot(x="longitude_country", y="latitude_country", kind="scatter",c="personName", colormap="YlOrRd",title=f"Location of billionaires",ax=ax)
plt.show()
It is interesting to see the same data as percentage of the countryβs GDP.
Supposing the target is the billionaireβs wealth, letβs correlate the numerical features to it and see how they perform
maindatasetC = maindataset[['finalWorth', 'category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]
klib.corr_plot(maindatasetC, target='finalWorth')
The fact that latitude is number 5 shows that there is a different between the global north and south. Again, this difference is even worse in real life, since here we are missing the data of the countries with NO billionaires.
I am making box plots for the categorical values
plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="category")
plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="country")
This shows the countries and industries where the billionaires, if they exist, tend to be richer.
It does not mean that the people in those industries and countries are richer in any way, or says anything about the number of billionaires. Those are different data points.
If we want to find the predictors of a billionaireβs wealth, I can use a random forest and then explain the variable weight in the model, but first we need to transform (encode) the columns so that the models can read them
num_columns = ['age', 'cpi_country', 'cpi_change_country', 'gdp_country', 'gross_tertiary_education_enrollment', 'gross_primary_education_enrollment_country',
'life_expectancy_country', 'tax_revenue_country_country', 'total_tax_rate_country', 'population_country', 'latitude_country', 'longitude_country']
cat_columns = ['category', 'country', 'selfMade', 'gender']
cat_preprocessor = OneHotEncoder(handle_unknown="ignore")
num_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
[
("one-hot-encoder", cat_preprocessor, cat_columns),
("standard_scaler", num_preprocessor, num_columns),
])
train = maindatasetC.copy()
train = train.dropna()
X_train = train[['category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]
Y_train = train[['finalWorth']]
X = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(X_train))
catnames = preprocessor.transformers_[0][1].get_feature_names_out(cat_columns).tolist()
numnames = preprocessor.transformers_[1][1].get_feature_names_out(num_columns).tolist()
featnames = catnames + numnames
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X, Y_train)
And plot the variable importance in the model
imp = rf.feature_importances_
imp = pd.Series(imp, index=featnames)
std = pd.Series(np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0), index=featnames)
fig, ax = plt.subplots()
imp.plot(kind='barh', yerr=std, ax=ax, figsize=(15,15))
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI