Data Analysis Project with PandasStep-by-Step Guide (Ted Talks Data)

Last Updated on March 1, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Data Analysis Project with Pandas — Step-by-Step Guide (Ted Talks Data)

Data Analysis Project Guide — Use Pandas power to get valuable information from your data

Photo by Lukas Blazek on UnsplashThe Pandas library is a powerful tool for multiple phases of the data science workflow, including data cleaning, visualization, and exploratory data analysis. However, the size and complexity of the Pandas library make it challenging to discover the best way to accomplish any given task.In this article, you’ll use pandas to answer questions about multiple real-world datasets. Through each exercise, you’ll learn important data science skills as well as “best practices” for using pandas. By the end of the tutorial, you’ll be more fluent at using pandas to correctly and efficiently answer your own data science questions.

Read the Data

The first step is to get the data and load it to memory. You can download data from this link: TED Talks. We are using Pandas for data manipulation. Matplotlib, Seaborn, and hvPlot for Data Visualization.

After reading the data successfully, we need to check our data. Pandas have a lot of function that allows us to discover our data effectively and quickly. The goal here is to find out more about the data and become a subject matter expert on the dataset you are working with.

In general, we need to know the following questions:

What kind of data do we have, and how do we treat different types?
What’s missing from the data, and how do you deal with it?
How can you add, change or remove features to get more out of your data?

The Shape of the Data

Use ‘.shape’ to see DataFrame shape: rows, columns

(2550, 17)

Pandas Data Types

To check the types of your data, you can use .dtypes and it will return a pandas series of columns associated with there dtype. Pandas have 7 types in general:

object, int64, float64, category, and datetime64 are going to be covered in this article.
bool: True/False values. Can be a NumPy datetime64[ns].
timedelta[ns]: Differences between two datetimes.

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object

Information About the Data

.info() method prints information about a DataFrame including the index dtype and columns, non-null values, and memory usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   comments            2550 non-null   int64 
 1   description         2550 non-null   object
 2   duration            2550 non-null   int64 
 3   event               2550 non-null   object
 4   film_date           2550 non-null   int64 
 5   languages           2550 non-null   int64 
 6   main_speaker        2550 non-null   object
 7   name                2550 non-null   object
 8   num_speaker         2550 non-null   int64 
 9   published_date      2550 non-null   int64 
 10  ratings             2550 non-null   object
 11  related_talks       2550 non-null   object
 12  speaker_occupation  2544 non-null   object
 13  tags                2550 non-null   object
 14  title               2550 non-null   object
 15  url                 2550 non-null   object
 16  views               2550 non-null   int64 
dtypes: int64(7), object(10)
memory usage: 338.8+ KB

Missing Values

Missing data (or missing values) is defined as the data value that is not stored for a variable in the observation of interest. In Machine Learning, we need to handle missing values. There are many types of missing values:

Standard Missing Values: These are missing values that Pandas can detect.
Non-Standard Missing Values: Sometimes it might be the case where there are missing values that have different formats.
Unexpected Missing Values: For example, if our feature is expected to be a string, but there’s a numeric type, then technically, this is also a missing value.

It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.

Sometimes you’ll simply want to delete those rows. Other times you’ll replace them.
A very common way to replace missing values is using a median (for objects) or mean (for numeric values).

But those are weak approaches, some times we need domain knowledge about the data and statistical study to fill in the missing values.

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    6
tags                  0
title                 0
url                   0
views                 0
dtype: int64

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    0
tags                  0
title                 0
url                   0
views                 0
dtype: int64

Descriptive Statistics about the Data

.describe() generates descriptive statistics. Descriptive statistics summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Which talks provoke the most online discussion?

From the heatmap, the number of views correlates well with language and comments.

Limitations of this approach

Sub comments (nested comments).
How long has it been online?

To correct this behavior, one solution is to normalize comments by views.

Lessons:

Consider the limitations and biases of your data when analyzing it
Make your results understandable

Visualize the distribution of comments

A line plot is not appropriate here (use it to measure something over time)

Check how many observations we removed from the plot:

(32, 19)

After filtering the data we lose only a small amount of data. This process is called excluding outliers.

Lessons:

Choose your plot type based on the question you are answering and the data type(s) you are working with
Use Pandas one-liners to iterate through plots quickly
Try modifying the plot defaults
Creating plots involves decision-making

Plot the number of talks that took place each year

The event column does not always include the year

1280    TEDGlobal 2012
163            TED2005
1593    TEDGlobal 2013
2027     TEDWomen 2015
197            TED2008
1286        TEDxAustin
1115      TEDxBrussels
1381          TEDxCHUV
159            TED2007
193            TED2007
Name: event, dtype: object

We can’t rely on event Feature, because most of the events don't have a year:

0    1140825600
1    1140825600
2    1140739200
3    1140912000
4    1140566400
Name: film_date, dtype: int64

Results don’t look right:

0   1970-01-01 00:00:01.140825600
1   1970-01-01 00:00:01.140825600
2   1970-01-01 00:00:01.140739200
3   1970-01-01 00:00:01.140912000
4   1970-01-01 00:00:01.140566400
Name: film_date, dtype: datetime64[ns]

Now the results look right:

0   2006-02-25
1   2006-02-25
2   2006-02-24
3   2006-02-26
4   2006-02-22
Name: film_date, dtype: datetime64[ns]

The new column uses the DateTime datatype (this was an automatic conversion):

comments                       int64
description                   object
duration                       int64
event                         object
film_date                      int64
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date                 int64
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
comments_per_view            float64
views_per_comment            float64
film_datetime         datetime64[ns]
dtype: object

DateTime columns have convenient attributes under the dt namespace:

0    2006
1    2006
2    2006
3    2006
4    2006
Name: film_datetime, dtype: int64

Count the number of talks each year using value_counts():

2013    270
2011    270
2010    267
2012    267
2016    246
2015    239
2014    237
2009    232
2007    114
2017     98
2008     84
2005     66
2006     50
2003     33
2004     33
2002     27
1998      6
2001      5
1984      1
1983      1
1991      1
1994      1
1990      1
1972      1
Name: film_datetime, dtype: int64

Points are plotted and connected in the order you give them to Pandas:

Need to sort the index before plotting:

Timestamp('2017-08-27 00:00:00')

Lessons:

Read the documentation
Use the DateTime data type for dates and times
Check your work as you go
Consider excluding data if it might not be relevant

What were the “best” events in TED history to attend?

355
274

Count the number of talks (great if you value variety, but they may not be great talks):

TED_2009    199
TED_2010    184
TED_2011    184
TED_2013    183
TED_2012    178
Name: event, dtype: int64

Use views as a proxy for “quality of talk”:

event
AORN Congress                  149818.0
Arbejdsglaede Live             971594.0
BBC TV                         521974.0
Bowery Poetry Club             676741.0
Business Innovation Factory    304086.0
Name: views, dtype: float64

Find the largest values, but we don’t know how many talks are being averaged:

event
TEDxNorrkoping        6569493.0
TEDxCreativeCoast     8444981.0
TEDxBloomington       9484259.5
TEDxHouston          16140250.5
TEDxPuget Sound      34309432.0
Name: views, dtype: float64

Show the number of talks along with the mean (events with the highest means had only 1 or 2 talks):

Calculate the total views per event:

Lessons:

Think creatively about how you can use the data you have to answer your question
Watch out for small sample sizes

Unpack the rating data

Previously, users could tag talks on the TED website (funny, inspiring, confusing, etc.)

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

Two ways to examine the rating data for the first talk:

"[{'id': 7, 'name': 'Funny', 'count': 19645}, {'id': 1, 'name': 'Beautiful', 'count': 4573}, {'id': 9, 'name': 'Ingenious', 'count': 6073}, {'id': 3, 'name': 'Courageous', 'count': 3253}, {'id': 11, 'name': 'Longwinded', 'count': 387}, {'id': 2, 'name': 'Confusing', 'count': 242}, {'id': 8, 'name': 'Informative', 'count': 7346}, {'id': 22, 'name': 'Fascinating', 'count': 10581}, {'id': 21, 'name': 'Unconvincing', 'count': 300}, {'id': 24, 'name': 'Persuasive', 'count': 10704}, {'id': 23, 'name': 'Jaw-dropping', 'count': 4439}, {'id': 25, 'name': 'OK', 'count': 1174}, {'id': 26, 'name': 'Obnoxious', 'count': 209}, {'id': 10, 'name': 'Inspiring', 'count': 24924}]"

str

Convert this into something useful using Python’s ast module (Abstract Syntax Tree):

list

[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]

list

Define a function to convert an element in the rating Series from
string to list:

Test the function:

[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]

Series apply method applies a function to every element in a Series and returns a Series:

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

lambda is a shorter alternative:

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

An even shorter alternative is to apply the function directly (without lambda):

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings, dtype: object

[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]

list

dtype('O')

comments                       int64
description                   object
duration                       int64
event                         object
film_date                      int64
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date                 int64
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
comments_per_view            float64
views_per_comment            float64
film_datetime         datetime64[ns]
ratings_list                  object
dtype: object

Lessons:

Pay attention to data types in Pandas
Use apply any time it is necessary

Count the total number of ratings received by each talk

Bonus exercises:

For each talk, calculate the percentage of ratings that were negative
For each talk, calculate the average number of ratings it received per day since it was published

[{'id': 7, 'name': 'Funny', 'count': 19645},
 {'id': 1, 'name': 'Beautiful', 'count': 4573},
 {'id': 9, 'name': 'Ingenious', 'count': 6073},
 {'id': 3, 'name': 'Courageous', 'count': 3253},
 {'id': 11, 'name': 'Longwinded', 'count': 387},
 {'id': 2, 'name': 'Confusing', 'count': 242},
 {'id': 8, 'name': 'Informative', 'count': 7346},
 {'id': 22, 'name': 'Fascinating', 'count': 10581},
 {'id': 21, 'name': 'Unconvincing', 'count': 300},
 {'id': 24, 'name': 'Persuasive', 'count': 10704},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 4439},
 {'id': 25, 'name': 'OK', 'count': 1174},
 {'id': 26, 'name': 'Obnoxious', 'count': 209},
 {'id': 10, 'name': 'Inspiring', 'count': 24924}]

{'id': 7, 'name': 'Funny', 'count': 19645}

[{'id': 7, 'name': 'Funny', 'count': 544},
 {'id': 3, 'name': 'Courageous', 'count': 139},
 {'id': 2, 'name': 'Confusing', 'count': 62},
 {'id': 1, 'name': 'Beautiful', 'count': 58},
 {'id': 21, 'name': 'Unconvincing', 'count': 258},
 {'id': 11, 'name': 'Longwinded', 'count': 113},
 {'id': 8, 'name': 'Informative', 'count': 443},
 {'id': 10, 'name': 'Inspiring', 'count': 413},
 {'id': 22, 'name': 'Fascinating', 'count': 132},
 {'id': 9, 'name': 'Ingenious', 'count': 56},
 {'id': 24, 'name': 'Persuasive', 'count': 268},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 116},
 {'id': 26, 'name': 'Obnoxious', 'count': 131},
 {'id': 25, 'name': 'OK', 'count': 203}]

0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64

0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64

0    93850
1     2936
2     2824
3     3728
4    25620
Name: ratings_list, dtype: int64

count     2550.000000
mean      2436.408235
std       4226.795631
min         68.000000
25%        870.750000
50%       1452.500000
75%       2506.750000
max      93850.000000
Name: num_ratings, dtype: float64

Lessons:

Write your code in small chunks, and check your work as you go
Lambda is best for simple functions

Which occupations deliver the funniest TED talks on average?

Bonus exercises:

for each talk, calculate the most frequent rating
for each talk, clean the occupation data so that there’s only one occupation per talk

Step 1: Count the number of funny ratings

0    [{'id': 7, 'name': 'Funny', 'count': 19645}, {...
1    [{'id': 7, 'name': 'Funny', 'count': 544}, {'i...
2    [{'id': 7, 'name': 'Funny', 'count': 964}, {'i...
3    [{'id': 3, 'name': 'Courageous', 'count': 760}...
4    [{'id': 9, 'name': 'Ingenious', 'count': 3202}...
Name: ratings_list, dtype: object

True    2550
Name: ratings, dtype: int64

[{'id': 3, 'name': 'Courageous', 'count': 760},
 {'id': 1, 'name': 'Beautiful', 'count': 291},
 {'id': 2, 'name': 'Confusing', 'count': 32},
 {'id': 7, 'name': 'Funny', 'count': 59},
 {'id': 9, 'name': 'Ingenious', 'count': 105},
 {'id': 21, 'name': 'Unconvincing', 'count': 36},
 {'id': 11, 'name': 'Longwinded', 'count': 53},
 {'id': 8, 'name': 'Informative', 'count': 380},
 {'id': 10, 'name': 'Inspiring', 'count': 1070},
 {'id': 22, 'name': 'Fascinating', 'count': 132},
 {'id': 24, 'name': 'Persuasive', 'count': 460},
 {'id': 23, 'name': 'Jaw-dropping', 'count': 230},
 {'id': 26, 'name': 'Obnoxious', 'count': 35},
 {'id': 25, 'name': 'OK', 'count': 85}]

0    19645
1      544
2      964
3       59
4     1390
Name: funny_ratings, dtype: int64

Step 2: Calculate the percentage of funny ratings

1849                       Science humorist
337                                Comedian
124     Performance poet, multimedia artist
315                                  Expert
1168             Social energy entrepreneur
1468                          Ornithologist
595                  Comedian, voice artist
1534                         Cartoon editor
97                                 Satirist
2297                          Actor, writer
568                                Comedian
675                          Data scientist
21                     Humorist, web artist
194                                Jugglers
2273                    Comedian and writer
2114                    Comedian and writer
173                                Investor
747                                Comedian
1398                               Comedian
685             Actor, comedian, playwright
Name: speaker_occupation, dtype: object

2549               Game designer
1612                   Biologist
612                     Sculptor
998               Penguin expert
593                     Engineer
284               Space activist
1041         Biomedical engineer
1618      Spinal cord researcher
2132    Computational geneticist
442                     Sculptor
426              Author, thinker
458                     Educator
2437      Environmental engineer
1491             Photojournalist
1893     Forensic anthropologist
783             Marine biologist
195                    Kenyan MP
772             HIV/AIDS fighter
788            Building activist
936                Neuroengineer
Name: speaker_occupation, dtype: object

Step 3: Analyze the funny rate by occupation

speaker_occupation
Comedian                       0.512457
Actor, writer                  0.515152
Actor, comedian, playwright    0.558107
Jugglers                       0.566828
Comedian and writer            0.602085
Name: funny_rate, dtype: float64

count       2550
unique      1458
top       Writer
freq          51
Name: speaker_occupation, dtype: object

Step 4: Focus on occupations that are well-represented in the data

Writer                                51
Designer                              34
Artist                                34
Journalist                            33
Entrepreneur                          31
                                      ..
Neuroanatomist                         1
Foreign policy strategist              1
Mythologist                            1
Designer, illustrator, typographer     1
Computational geneticist               1
Name: speaker_occupation, Length: 1458, dtype: int64

pandas.core.series.Series

Index(['Writer', 'Designer', 'Artist', 'Journalist', 'Entrepreneur',
       'Architect', 'Inventor', 'Psychologist', 'Photographer', 'Filmmaker', 'Educator', 'Economist', 'Author', 'Neuroscientist', 'Philosopher', 'Roboticist', 'Biologist', 'Physicist', 'Marine biologist', 'Musician', 'Technologist', 'Activist', 'Global health expert; data visionary', 'Astronomer', 'Poet', 'Oceanographer', 'Graphic designer', 'Philanthropist', 'Singer/songwriter', 'Behavioral economist',
       'Historian', 'Social psychologist', 'Novelist', 'Futurist', 'Engineer',
       'Computer scientist', 'Astrophysicist', 'Mathematician', 'Comedian',
       'Photojournalist', 'Reporter', 'Evolutionary biologist',
       'Techno-illusionist', 'Writer, activist', 'Legal activist',
       'Social entrepreneur', 'Performance poet, multimedia artist',
       'Singer-songwriter', 'Climate advocate', 'Producer', 'Paleontologist',
       'Environmentalist, futurist', 'Science writer', 'Sound consultant',
       'Investor and advocate for moral leadership', 'Game designer',
       'Cartoonist', 'Tech visionary', 'Sculptor', 'Social Media Theorist',
       'Surgeon', 'Data scientist', 'Physician', 'Researcher', 'Chemist',
       'Musician, activist', 'Violinist', 'Chef'],
      dtype='object')

Step 5: Re-analyze the funny rate by occupation (for top occupations only)

(792, 24)

Lessons:

Check your assumptions about your data
Check whether your results are reasonable
Take advantage of the fact that pandas operations often output a DataFrame or a Series
Watch out for small sample sizes
Consider the impact of missing data
Data scientists are hilarious

Links and Resources:

Link to data used in this tutorial: TED Talks
Link to Full Notebook
Part II: Comprehensive Data Analysis with Pandas — Part II
This blog follows the talk from PyCon 2019 of Kevin Markham. PyCon: Full Conference. Kevin Markham Youtube channel: Data School

Data Analysis Project with Pandas — Step-by-Step Guide (Ted Talks Data) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Data Analysis Project with PandasStep-by-Step Guide (Ted Talks Data)

Author(s): Fares Sayah

Data Analysis Project with Pandas — Step-by-Step Guide (Ted Talks Data)

Data Analysis Project Guide — Use Pandas power to get valuable information from your data

Table of Contents:

Read the Data

The Shape of the Data

Pandas Data Types

Information About the Data

Missing Values

Descriptive Statistics about the Data

Which talks provoke the most online discussion?

Limitations of this approach

Lessons:

Visualize the distribution of comments

Lessons:

Plot the number of talks that took place each year

Lessons:

What were the “best” events in TED history to attend?

Lessons:

Unpack the rating data

Lessons:

Count the total number of ratings received by each talk

Lessons:

Which occupations deliver the funniest TED talks on average?

Step 1: Count the number of funny ratings

Step 2: Calculate the percentage of funny ratings

Step 3: Analyze the funny rate by occupation

Step 4: Focus on occupations that are well-represented in the data

Step 5: Re-analyze the funny rate by occupation (for top occupations only)

Lessons:

Links and Resources:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥