Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Descriptive Analysis
Latest   Machine Learning

Descriptive Analysis

Last Updated on June 11, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

Descriptive analytics is a big part of working with data, it could be before more complicated tasks, such as machine learning or on its own.

In this example I illustrate the use of deep descriptive analysis, using latera regression model to measure variable importance using a dataset of characteristic of the countries of origin of billionaires.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import klib
import geopandas as geo
import dabl
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

maindataset = pd.read_csv("Billionaires_Statistics_Dataset.csv")
maindataset

The first step is some cleanup of the data, to make the numerical values fit into numbers, and other formatting

maindataset["gdp_country"] = maindataset["gdp_country"].astype(str).str.replace('\$','').astype(str).str.replace(' ','').astype(str).str.replace(',','')
maindataset["gdp_country"] = pd.to_numeric(maindataset["gdp_country"], errors='coerce')
maindataset["selfMade"] = maindataset["selfMade"].astype(str)

Now, I am ready to display the distribution for numerical variables, using the klib package: klib · PyPI

klib.dist_plot(maindataset['finalWorth'])
klib.dist_plot(maindataset['age'])
klib.dist_plot(maindataset['cpi_country'])
klib.dist_plot(maindataset['cpi_change_country'])
klib.dist_plot(maindataset['gdp_country'])
klib.dist_plot(maindataset['gross_tertiary_education_enrollment'])
klib.dist_plot(maindataset['gross_primary_education_enrollment_country'])
klib.dist_plot(maindataset['life_expectancy_country'])
klib.dist_plot(maindataset['tax_revenue_country_country'])
klib.dist_plot(maindataset['population_country'])

I like that these results came up since it allows me to explain a couple of things.

If we look at the current state here, we could conclude that billionaires appear in countries that are very rich or very poor, or small and big, and not in medium population or revenue.

Well, that conclusion will be drawn on incomplete data, and so probably inaccurate, since we don’t have in this dataset information to obtain the distribution of the data of all countries, as well that many billionaires could come from the same country, which skews the data in favor of a value (we will see that when analyzing the categorical variables).

It is important not to jump to conclusions based on incomplete data. The value of our analysis is limited always by the availability of data, and we need to make that point clear to the stakeholders.

It is the responsibility of a good analyst to always describe the limits of the data and the assumptions made when writing their reports.

Moving to the description of categorical variables

klib.cat_plot(maindataset[['category','country','selfMade','gender']], top=5, bottom=5)

This map shows the location of billionaires showing the total wealth they hold by country

locationdf = maindataset.groupby(["longitude_country","latitude_country"])["personName"].count()
locationdf = locationdf.reset_index(drop=False)

fig, ax = plt.subplots(figsize=(20,10))
countries = geo.read_file(geo.datasets.get_path("naturalearth_lowres"))
countries.plot(color="lightgrey", ax=ax)
locationdf.plot(x="longitude_country", y="latitude_country", kind="scatter",c="personName", colormap="YlOrRd",title=f"Location of billionaires",ax=ax)
plt.show()

It is interesting to see the same data as percentage of the country’s GDP.

Supposing the target is the billionaire’s wealth, let’s correlate the numerical features to it and see how they perform

maindatasetC = maindataset[['finalWorth', 'category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]

klib.corr_plot(maindatasetC, target='finalWorth')

The fact that latitude is number 5 shows that there is a different between the global north and south. Again, this difference is even worse in real life, since here we are missing the data of the countries with NO billionaires.

I am making box plots for the categorical values

plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="category")
plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="country")

This shows the countries and industries where the billionaires, if they exist, tend to be richer.

It does not mean that the people in those industries and countries are richer in any way, or says anything about the number of billionaires. Those are different data points.

If we want to find the predictors of a billionaire’s wealth, I can use a random forest and then explain the variable weight in the model, but first we need to transform (encode) the columns so that the models can read them

num_columns = ['age', 'cpi_country', 'cpi_change_country', 'gdp_country', 'gross_tertiary_education_enrollment', 'gross_primary_education_enrollment_country',
'life_expectancy_country', 'tax_revenue_country_country', 'total_tax_rate_country', 'population_country', 'latitude_country', 'longitude_country']
cat_columns = ['category', 'country', 'selfMade', 'gender']
cat_preprocessor = OneHotEncoder(handle_unknown="ignore")
num_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
[
("one-hot-encoder", cat_preprocessor, cat_columns),
("standard_scaler", num_preprocessor, num_columns),
])

train = maindatasetC.copy()
train = train.dropna()

X_train = train[['category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]
Y_train = train[['finalWorth']]

X = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(X_train))
catnames = preprocessor.transformers_[0][1].get_feature_names_out(cat_columns).tolist()
numnames = preprocessor.transformers_[1][1].get_feature_names_out(num_columns).tolist()
featnames = catnames + numnames

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X, Y_train)

And plot the variable importance in the model

imp = rf.feature_importances_
imp = pd.Series(imp, index=featnames)
std = pd.Series(np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0), index=featnames)
fig, ax = plt.subplots()
imp.plot(kind='barh', yerr=std, ax=ax, figsize=(15,15))
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓