Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Data Analysis Project with PandasStep-by-Step Guide (Ted Talks Data)
Latest

Data Analysis Project with PandasStep-by-Step Guide (Ted Talks Data)

Last Updated on March 1, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Data Analysis Project with Pandas — Step-by-Step Guide (Ted Talks Data)

Data Analysis Project Guide — Use Pandas power to get valuable information from your data

Photo by Lukas Blazek on UnsplashThe Pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the Pandas library make it challenging to discover the best way to accomplish any given task.In this article, you’ll use pandas to answer questions about multiple real-world datasets. Through each exercise, you’ll learn important data science skills as well as “best practices” for using pandas. By the end of the tutorial, you’ll be more fluent at using pandas to correctly and efficiently answer your own data science questions.

Table of Contents:

Exploratory Data Analysis is all about answering a specific question. In this post, we will try to answer the following questions:

  1. Which talks provoke the most online discussion?
  2. What were the “best” events in TED history to attend?
  3. Which occupations deliver the funniest TED talks on average?

Read the Data

The first step is to get the data and load it to memory. You can download data from this link: TED Talks. We are using Pandas for data manipulation. Matplotlib, Seaborn, and hvPlot for Data Visualization.

After reading the data successfully, we need to check our data. Pandas have a lot of function that allows us to discover our data effectively and quickly. The goal here is to find out more about the data and become a subject matter expert on the dataset you are working with.

In general, we need to know the following questions:

  1. What kind of data do we have, and how do we treat different types?
  2. What’s missing from the data, and how do you deal with it?
  3. How can you add, change or remove features to get more out of your data?

The Shape of the Data

Use ‘.shape’ to see DataFrame shape: rows, columns

(2550, 17)

Pandas Data Types

To check the types of your data, you can use .dtypes and it will return a pandas series of columns associated with there dtype. Pandas have 7 types in general:

  • object, int64, float64, category, and datetime64 are going to be covered in this article.
  • bool: True/False values. Can be a NumPy datetime64[ns].
  • timedelta[ns]: Differences between two datetimes.
comments               int64
description object
duration int64
event object
film_date int64
languages int64
main_speaker object
name object
num_speaker int64
published_date int64
ratings object
related_talks object
speaker_occupation object
tags object
title object
url object
views int64
dtype: object

Information About the Data

.info() method prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 comments 2550 non-null int64
1 description 2550 non-null object
2 duration 2550 non-null int64
3 event 2550 non-null object
4 film_date 2550 non-null int64
5 languages 2550 non-null int64
6 main_speaker 2550 non-null object
7 name 2550 non-null object
8 num_speaker 2550 non-null int64
9 published_date 2550 non-null int64
10 ratings 2550 non-null object
11 related_talks 2550 non-null object
12 speaker_occupation 2544 non-null object
13 tags 2550 non-null object
14 title 2550 non-null object
15 url 2550 non-null object
16 views 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB

Missing Values

Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. In Machine Learning, we need to handle missing values. There are many types of missing values:

  • Standard Missing Values: These are missing values that Pandas can detect.
  • Non-Standard Missing Values: Sometimes it might be the case where there are missing values that have different formats.
  • Unexpected Missing Values: For example, if our feature is expected to be a string, but there’s a numeric type, then technically, this is also a missing value.

It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.

  • Sometimes you’ll simply want to delete those rows. Other times you’ll replace them.
  • A very common way to replace missing values is using a median (for objects) or mean (for numeric values).

But those are weak approaches, some times we need domain knowledge about the data and statistical study to fill in the missing values.

comments              0
description 0
duration 0
event 0
film_date 0
languages 0
main_speaker 0
name 0
num_speaker 0
published_date 0
ratings 0
related_talks 0
speaker_occupation 6
tags 0
title 0
url 0
views 0
dtype: int64

comments              0
description 0
duration 0
event 0
film_date 0
languages 0
main_speaker 0
name 0
num_speaker 0
published_date 0
ratings 0
related_talks 0
speaker_occupation 0
tags 0
title 0
url 0
views 0
dtype: int64

Descriptive Statistics about the Data

.describe() generates descriptive statistics. Descriptive statistics summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Which talks provoke the most online discussion?

png

From the heatmap, the number of views correlates well with language and comments.

Limitations of this approach

  1. Sub comments (nested comments).
  2. How long has it been online?

To correct this behavior, one solution is to normalize comments by views.

Lessons:

  1. Consider the limitations and biases of your data when analyzing it
  2. Make your results understandable

Visualize the distribution of comments

A line plot is not appropriate here (use it to measure something over time)

Check how many observations we removed from the plot:

(32, 19)

After filtering the data we lose only a small amount of data. This process is called excluding outliers.

Lessons:

  1. Choose your plot type based on the question you are answering and the data type(s) you are working with
  2. Use Pandas one-liners to iterate through plots quickly
  3. Try modifying the plot defaults
  4. Creating plots involves decision-making

Plot the number of talks that took place each year

The event column does not always include the year

1280    TEDGlobal 2012
163 TED2005
1593 TEDGlobal 2013
2027 TEDWomen 2015
197 TED2008
1286 TEDxAustin
1115 TEDxBrussels
1381 TEDxCHUV
159 TED2007
193 TED2007
Name: event, dtype: object

We can’t rely on event Feature, because most of the events don't have a year:

0    1140825600
1 1140825600
2 1140739200
3 1140912000
4 1140566400
Name: film_date, dtype: int64

Results don’t look right:

0   1970-01-01 00:00:01.140825600
1 1970-01-01 00:00:01.140825600
2 1970-01-01 00:00:01.140739200
3 1970-01-01 00:00:01.140912000
4 1970-01-01 00:00:01.140566400
Name: film_date, dtype: datetime64[ns]

Now the results look right:

0   2006-02-25
1 2006-02-25
2 2006-02-24
3 2006-02-26
4 2006-02-22
Name: film_date, dtype: datetime64[ns]

The new column uses the DateTime datatype (this was an automatic conversion):

comments                       int64
description object
duration int64
event object
film_date int64
languages int64
main_speaker object
name object
num_speaker int64
published_date int64
ratings object
related_talks object
speaker_occupation object
tags object
title object
url object
views int64
comments_per_view float64
views_per_comment float64
film_datetime datetime64[ns]
dtype: object

DateTime columns have convenient attributes under the dt namespace:

0    2006
1 2006
2 2006
3 2006
4 2006
Name: film_datetime, dtype: int64

Count the number of talks each year using value_counts():

2013    270
2011 270
2010 267
2012 267
2016 246
2015 239
2014 237
2009 232
2007 114
2017 98
2008 84
2005 66
2006 50
2003 33
2004 33
2002 27
1998 6
2001 5
1984 1
1983 1
1991 1
1994 1
1990 1
1972 1
Name: film_datetime, dtype: int64

Points are plotted and connected in the order you give them to Pandas:

Need to sort the index before plotting:

Timestamp('2017-08-27 00:00:00')

Lessons:

  1. Read the documentation
  2. Use the DateTime data type for dates and times
  3. Check your work as you go
  4. Consider excluding data if it might not be relevant

What were the “best” events in TED history to attend?

355
274

Count the number of talks (great if you value variety, but they may not be great talks):

TED_2009    199
TED_2010 184
TED_2011 184
TED_2013 183
TED_2012 178
Name: event, dtype: int64

Use views as a proxy for “quality of talk”:

event
AORN Congress 149818.0
Arbejdsglaede Live 971594.0
BBC TV 521974.0
Bowery Poetry Club 676741.0
Business Innovation Factory 304086.0
Name: views, dtype: float64

Find the largest values, but we don’t know how many talks are being averaged:

event
TEDxNorrkoping 6569493.0
TEDxCreativeCoast 8444981.0
TEDxBloomington 9484259.5
TEDxHouston 16140250.5
TEDxPuget Sound 34309432.0
Name: views, dtype: float64

Show the number of talks along with the mean (events with the highest means had only 1 or 2 talks):

Calculate the total views per event:

Lessons:

  1. Think creatively about how you can use the data you have to answer your question
  2. Watch out for small sample sizes

Unpack the rating data

Previously, users could tag talks on the TED website (funny, inspiring, confusing, etc.)

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3 [{'id': 3, 'name': 'Courageous', 'count': 760}...
4 [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

Two ways to examine the rating data for the first talk:

"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

str

Convert this into something useful using Python’s ast module (Abstract Syntax Tree):

list
[{'id': 7, 'name': 'Funny', 'count': 19645},
{'id': 1, 'name': 'Beautiful', 'count': 4573},
{'id': 9, 'name': 'Ingenious', 'count': 6073},
{'id': 3, 'name': 'Courageous', 'count': 3253},
{'id': 11, 'name': 'Longwinded', 'count': 387},
{'id': 2, 'name': 'Confusing', 'count': 242},
{'id': 8, 'name': 'Informative', 'count': 7346},
{'id': 22, 'name': 'Fascinating', 'count': 10581},
{'id': 21, 'name': 'Unconvincing', 'count': 300},
{'id': 24, 'name': 'Persuasive', 'count': 10704},
{'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
{'id': 25, 'name': 'OK', 'count': 1174},
{'id': 26, 'name': 'Obnoxious', 'count': 209},
{'id': 10, 'name': 'Inspiring', 'count': 24924}]
list

Define a function to convert an element in the rating Series from
string to list:

Test the function:

[{'id': 7, 'name': 'Funny', 'count': 19645},
{'id': 1, 'name': 'Beautiful', 'count': 4573},
{'id': 9, 'name': 'Ingenious', 'count': 6073},
{'id': 3, 'name': 'Courageous', 'count': 3253},
{'id': 11, 'name': 'Longwinded', 'count': 387},
{'id': 2, 'name': 'Confusing', 'count': 242},
{'id': 8, 'name': 'Informative', 'count': 7346},
{'id': 22, 'name': 'Fascinating', 'count': 10581},
{'id': 21, 'name': 'Unconvincing', 'count': 300},
{'id': 24, 'name': 'Persuasive', 'count': 10704},
{'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
{'id': 25, 'name': 'OK', 'count': 1174},
{'id': 26, 'name': 'Obnoxious', 'count': 209},
{'id': 10, 'name': 'Inspiring', 'count': 24924}]

Series apply method applies a function to every element in a Series and returns a Series:

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3 [{'id': 3, 'name': 'Courageous', 'count': 760}...
4 [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

lambda is a shorter alternative:

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3 [{'id': 3, 'name': 'Courageous', 'count': 760}...
4 [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

An even shorter alternative is to apply the function directly (without lambda):

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3 [{'id': 3, 'name': 'Courageous', 'count': 760}...
4 [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

[{'id': 7, 'name': 'Funny', 'count': 19645},
{'id': 1, 'name': 'Beautiful', 'count': 4573},
{'id': 9, 'name': 'Ingenious', 'count': 6073},
{'id': 3, 'name': 'Courageous', 'count': 3253},
{'id': 11, 'name': 'Longwinded', 'count': 387},
{'id': 2, 'name': 'Confusing', 'count': 242},
{'id': 8, 'name': 'Informative', 'count': 7346},
{'id': 22, 'name': 'Fascinating', 'count': 10581},
{'id': 21, 'name': 'Unconvincing', 'count': 300},
{'id': 24, 'name': 'Persuasive', 'count': 10704},
{'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
{'id': 25, 'name': 'OK', 'count': 1174},
{'id': 26, 'name': 'Obnoxious', 'count': 209},
{'id': 10, 'name': 'Inspiring', 'count': 24924}]

list
dtype('O')
comments                       int64
description object
duration int64
event object
film_date int64
languages int64
main_speaker object
name object
num_speaker int64
published_date int64
ratings object
related_talks object
speaker_occupation object
tags object
title object
url object
views int64
comments_per_view float64
views_per_comment float64
film_datetime datetime64[ns]
ratings_list object
dtype: object

Lessons:

  1. Pay attention to data types in Pandas
  2. Use apply any time it is necessary

Count the total number of ratings received by each talk

Bonus exercises:

  • For each talk, calculate the percentage of ratings that were negative
  • For each talk, calculate the average number of ratings it received per day since it was published
[{'id': 7, 'name': 'Funny', 'count': 19645},
{'id': 1, 'name': 'Beautiful', 'count': 4573},
{'id': 9, 'name': 'Ingenious', 'count': 6073},
{'id': 3, 'name': 'Courageous', 'count': 3253},
{'id': 11, 'name': 'Longwinded', 'count': 387},
{'id': 2, 'name': 'Confusing', 'count': 242},
{'id': 8, 'name': 'Informative', 'count': 7346},
{'id': 22, 'name': 'Fascinating', 'count': 10581},
{'id': 21, 'name': 'Unconvincing', 'count': 300},
{'id': 24, 'name': 'Persuasive', 'count': 10704},
{'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
{'id': 25, 'name': 'OK', 'count': 1174},
{'id': 26, 'name': 'Obnoxious', 'count': 209},
{'id': 10, 'name': 'Inspiring', 'count': 24924}]

{'id': 7, 'name': 'Funny', 'count': 19645}

19645

93850

[{'id': 7, 'name': 'Funny', 'count': 544},
{'id': 3, 'name': 'Courageous', 'count': 139},
{'id': 2, 'name': 'Confusing', 'count': 62},
{'id': 1, 'name': 'Beautiful', 'count': 58},
{'id': 21, 'name': 'Unconvincing', 'count': 258},
{'id': 11, 'name': 'Longwinded', 'count': 113},
{'id': 8, 'name': 'Informative', 'count': 443},
{'id': 10, 'name': 'Inspiring', 'count': 413},
{'id': 22, 'name': 'Fascinating', 'count': 132},
{'id': 9, 'name': 'Ingenious', 'count': 56},
{'id': 24, 'name': 'Persuasive', 'count': 268},
{'id': 23, 'name': 'Jaw-dropping', 'count': 116},
{'id': 26, 'name': 'Obnoxious', 'count': 131},
{'id': 25, 'name': 'OK', 'count': 203}]

2936

0    93850
1 2936
2 2824
3 3728
4 25620
Name: ratings_list, dtype: int64

93850

0    93850
1 2936
2 2824
3 3728
4 25620
Name: ratings_list, dtype: int64

93850

0    93850
1 2936
2 2824
3 3728
4 25620
Name: ratings_list, dtype: int64

count     2550.000000
mean 2436.408235
std 4226.795631
min 68.000000
25% 870.750000
50% 1452.500000
75% 2506.750000
max 93850.000000
Name: num_ratings, dtype: float64

Lessons:

  1. Write your code in small chunks, and check your work as you go
  2. Lambda is best for simple functions

Which occupations deliver the funniest TED talks on average?

Bonus exercises:

  • for each talk, calculate the most frequent rating
  • for each talk, clean the occupation data so that there’s only one occupation per talk

Step 1: Count the number of funny ratings

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1 [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2 [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3 [{'id': 3, 'name': 'Courageous', 'count': 760}...
4 [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings_list, dtype: object

True    2550
Name: ratings, dtype: int64

[{'id': 3, 'name': 'Courageous', 'count': 760},
{'id': 1, 'name': 'Beautiful', 'count': 291},
{'id': 2, 'name': 'Confusing', 'count': 32},
{'id': 7, 'name': 'Funny', 'count': 59},
{'id': 9, 'name': 'Ingenious', 'count': 105},
{'id': 21, 'name': 'Unconvincing', 'count': 36},
{'id': 11, 'name': 'Longwinded', 'count': 53},
{'id': 8, 'name': 'Informative', 'count': 380},
{'id': 10, 'name': 'Inspiring', 'count': 1070},
{'id': 22, 'name': 'Fascinating', 'count': 132},
{'id': 24, 'name': 'Persuasive', 'count': 460},
{'id': 23, 'name': 'Jaw-dropping', 'count': 230},
{'id': 26, 'name': 'Obnoxious', 'count': 35},
{'id': 25, 'name': 'OK', 'count': 85}]

59

0    19645
1 544
2 964
3 59
4 1390
Name: funny_ratings, dtype: int64

0

Step 2: Calculate the percentage of funny ratings

1849                       Science humorist
337 Comedian
124 Performance poet, multimedia artist
315 Expert
1168 Social energy entrepreneur
1468 Ornithologist
595 Comedian, voice artist
1534 Cartoon editor
97 Satirist
2297 Actor, writer
568 Comedian
675 Data scientist
21 Humorist, web artist
194 Jugglers
2273 Comedian and writer
2114 Comedian and writer
173 Investor
747 Comedian
1398 Comedian
685 Actor, comedian, playwright
Name: speaker_occupation, dtype: object

2549               Game designer
1612 Biologist
612 Sculptor
998 Penguin expert
593 Engineer
284 Space activist
1041 Biomedical engineer
1618 Spinal cord researcher
2132 Computational geneticist
442 Sculptor
426 Author, thinker
458 Educator
2437 Environmental engineer
1491 Photojournalist
1893 Forensic anthropologist
783 Marine biologist
195 Kenyan MP
772 HIV/AIDS fighter
788 Building activist
936 Neuroengineer
Name: speaker_occupation, dtype: object

Step 3: Analyze the funny rate by occupation

speaker_occupation
Comedian 0.512457
Actor, writer 0.515152
Actor, comedian, playwright 0.558107
Jugglers 0.566828
Comedian and writer 0.602085
Name: funny_rate, dtype: float64

count       2550
unique 1458
top Writer
freq 51
Name: speaker_occupation, dtype: object

Step 4: Focus on occupations that are well-represented in the data

Writer                                51
Designer 34
Artist 34
Journalist 33
Entrepreneur 31
..
Neuroanatomist 1
Foreign policy strategist 1
Mythologist 1
Designer, illustrator, typographer 1
Computational geneticist 1
Name: speaker_occupation, Length: 1458, dtype: int64

pandas.core.series.Series

Index(['Writer', 'Designer', 'Artist', 'Journalist', 'Entrepreneur',
'Architect', 'Inventor', 'Psychologist', 'Photographer', 'Filmmaker', 'Educator', 'Economist', 'Author', 'Neuroscientist', 'Philosopher', 'Roboticist', 'Biologist', 'Physicist', 'Marine biologist', 'Musician', 'Technologist', 'Activist', 'Global health expert; data visionary', 'Astronomer', 'Poet', 'Oceanographer', 'Graphic designer', 'Philanthropist', 'Singer/songwriter', 'Behavioral economist',
'Historian', 'Social psychologist', 'Novelist', 'Futurist', 'Engineer',
'Computer scientist', 'Astrophysicist', 'Mathematician', 'Comedian',
'Photojournalist', 'Reporter', 'Evolutionary biologist',
'Techno-illusionist', 'Writer, activist', 'Legal activist',
'Social entrepreneur', 'Performance poet, multimedia artist',
'Singer-songwriter', 'Climate advocate', 'Producer', 'Paleontologist',
'Environmentalist, futurist', 'Science writer', 'Sound consultant',
'Investor and advocate for moral leadership', 'Game designer',
'Cartoonist', 'Tech visionary', 'Sculptor', 'Social Media Theorist',
'Surgeon', 'Data scientist', 'Physician', 'Researcher', 'Chemist',
'Musician, activist', 'Violinist', 'Chef'],
dtype='object')

Step 5: Re-analyze the funny rate by occupation (for top occupations only)

(792, 24)

Lessons:

  1. Check your assumptions about your data
  2. Check whether your results are reasonable
  3. Take advantage of the fact that pandas operations often output a DataFrame or a Series
  4. Watch out for small sample sizes
  5. Consider the impact of missing data
  6. Data scientists are hilarious

Links and Resources:


Data Analysis Project with Pandas — Step-by-Step Guide (Ted Talks Data) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓