Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Descriptive Analysis
Latest   Machine Learning

Descriptive Analysis

Last Updated on June 11, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

Descriptive analytics is a big part of working with data, it could be before more complicated tasks, such as machine learning or on its own.

In this example I illustrate the use of deep descriptive analysis, using latera regression model to measure variable importance using a dataset of characteristic of the countries of origin of billionaires.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import klib
import geopandas as geo
import dabl
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

maindataset = pd.read_csv("Billionaires_Statistics_Dataset.csv")
maindataset

The first step is some cleanup of the data, to make the numerical values fit into numbers, and other formatting

maindataset["gdp_country"] = maindataset["gdp_country"].astype(str).str.replace('\$','').astype(str).str.replace(' ','').astype(str).str.replace(',','')
maindataset["gdp_country"] = pd.to_numeric(maindataset["gdp_country"], errors='coerce')
maindataset["selfMade"] = maindataset["selfMade"].astype(str)

Now, I am ready to display the distribution for numerical variables, using the klib package: klib Β· PyPI

klib.dist_plot(maindataset['finalWorth'])
klib.dist_plot(maindataset['age'])
klib.dist_plot(maindataset['cpi_country'])
klib.dist_plot(maindataset['cpi_change_country'])
klib.dist_plot(maindataset['gdp_country'])
klib.dist_plot(maindataset['gross_tertiary_education_enrollment'])
klib.dist_plot(maindataset['gross_primary_education_enrollment_country'])
klib.dist_plot(maindataset['life_expectancy_country'])
klib.dist_plot(maindataset['tax_revenue_country_country'])
klib.dist_plot(maindataset['population_country'])

I like that these results came up since it allows me to explain a couple of things.

If we look at the current state here, we could conclude that billionaires appear in countries that are very rich or very poor, or small and big, and not in medium population or revenue.

Well, that conclusion will be drawn on incomplete data, and so probably inaccurate, since we don’t have in this dataset information to obtain the distribution of the data of all countries, as well that many billionaires could come from the same country, which skews the data in favor of a value (we will see that when analyzing the categorical variables).

It is important not to jump to conclusions based on incomplete data. The value of our analysis is limited always by the availability of data, and we need to make that point clear to the stakeholders.

It is the responsibility of a good analyst to always describe the limits of the data and the assumptions made when writing their reports.

Moving to the description of categorical variables

klib.cat_plot(maindataset[['category','country','selfMade','gender']], top=5, bottom=5)

This map shows the location of billionaires showing the total wealth they hold by country

locationdf = maindataset.groupby(["longitude_country","latitude_country"])["personName"].count()
locationdf = locationdf.reset_index(drop=False)

fig, ax = plt.subplots(figsize=(20,10))
countries = geo.read_file(geo.datasets.get_path("naturalearth_lowres"))
countries.plot(color="lightgrey", ax=ax)
locationdf.plot(x="longitude_country", y="latitude_country", kind="scatter",c="personName", colormap="YlOrRd",title=f"Location of billionaires",ax=ax)
plt.show()

It is interesting to see the same data as percentage of the country’s GDP.

Supposing the target is the billionaire’s wealth, let’s correlate the numerical features to it and see how they perform

maindatasetC = maindataset[['finalWorth', 'category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]

klib.corr_plot(maindatasetC, target='finalWorth')

The fact that latitude is number 5 shows that there is a different between the global north and south. Again, this difference is even worse in real life, since here we are missing the data of the countries with NO billionaires.

I am making box plots for the categorical values

plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="category")
plt.figure(figsize=(15,15))
ax = sns.boxplot(data=maindatasetC, x="finalWorth", y="country")

This shows the countries and industries where the billionaires, if they exist, tend to be richer.

It does not mean that the people in those industries and countries are richer in any way, or says anything about the number of billionaires. Those are different data points.

If we want to find the predictors of a billionaire’s wealth, I can use a random forest and then explain the variable weight in the model, but first we need to transform (encode) the columns so that the models can read them

num_columns = ['age', 'cpi_country', 'cpi_change_country', 'gdp_country', 'gross_tertiary_education_enrollment', 'gross_primary_education_enrollment_country',
'life_expectancy_country', 'tax_revenue_country_country', 'total_tax_rate_country', 'population_country', 'latitude_country', 'longitude_country']
cat_columns = ['category', 'country', 'selfMade', 'gender']
cat_preprocessor = OneHotEncoder(handle_unknown="ignore")
num_preprocessor = StandardScaler()
preprocessor = ColumnTransformer(
[
("one-hot-encoder", cat_preprocessor, cat_columns),
("standard_scaler", num_preprocessor, num_columns),
])

train = maindatasetC.copy()
train = train.dropna()

X_train = train[['category', 'age', 'country',
'selfMade', 'gender', 'cpi_country', 'cpi_change_country',
'gdp_country', 'gross_tertiary_education_enrollment',
'gross_primary_education_enrollment_country', 'life_expectancy_country',
'tax_revenue_country_country', 'total_tax_rate_country',
'population_country', 'latitude_country', 'longitude_country']]
Y_train = train[['finalWorth']]

X = pd.DataFrame.sparse.from_spmatrix(preprocessor.fit_transform(X_train))
catnames = preprocessor.transformers_[0][1].get_feature_names_out(cat_columns).tolist()
numnames = preprocessor.transformers_[1][1].get_feature_names_out(num_columns).tolist()
featnames = catnames + numnames

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X, Y_train)

And plot the variable importance in the model

imp = rf.feature_importances_
imp = pd.Series(imp, index=featnames)
std = pd.Series(np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0), index=featnames)
fig, ax = plt.subplots()
imp.plot(kind='barh', yerr=std, ax=ax, figsize=(15,15))
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
plt.show()

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓