Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Mastering Data Analysis with Pandas: A Step-by-Step Guide using Stanford Open Policing Data
Latest

Mastering Data Analysis with Pandas: A Step-by-Step Guide using Stanford Open Policing Data

Last Updated on February 23, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Data Analysis Project Guideβ€Šβ€”β€ŠUse Pandas power to get valuable information from yourΒ data

Photo by path digital onΒ UnsplashThe Pandas library offers a wide range of capabilities for data science, including cleaning, visualization, and exploration. However, it is important to note that just because you are using this tool does not guarantee accurate results. Improper use of the library can lead to incorrect conclusions, missed data, or misleading calculations. This tutorial will guide you through common data science tasks, showing you both how to effectively use Pandas and how to avoid common mistakes. By the end, you will have the confidence and knowledge to use Pandas in a responsible and effective manner.

This dataset contains the data obtained from the number of traffic stops made by US police and what happened during these stops. The data ranges from 2005 to 2015. The dataset was obtained from Stanford Open Policing Project. The purpose of collecting this data was to monitor and enhance the interactions between law enforcement and the public in theΒ country.

Table of Contents:

In this project, we examine a number of factors that relate to police stopping a group of people from the Stanford region of the United States. In particular, we wish to understand the association between driver age and driver gender, as well as the time of day on police, stopΒ them.

A good data analysis project is all about asking questions. In this blog post, we are going to answer the following questions:

  1. Do men or women speed moreΒ often?
  2. Does gender affect who gets searched during aΒ stop?
  3. During a search, how often is the driverΒ frisked?
  4. Which year had the least number ofΒ stops?
  5. How does drug activity change by time ofΒ day?
  6. Do most stops occur atΒ night?

Read theΒ Data

The first step is to get the data and load it to memory. You can download data from this link: Stanford Open Policing Project. We are using Pandas for data manipulation. Matplotlib, Seaborn, and hvPlot for Data Visualization. A quick way to check your data is by usingΒ .head()Β method.

png

After reading the data successfully, we need to check our data. Pandas have a lot of function that allows us to discover our data effectively and quickly. The goal here is to find out more about the data and become a subject matter expert on the dataset you are workingΒ with.

In general, we need to know the following questions:

  1. What kind of data do we have, and how do we treat different types?
  2. What’s missing from the data, and how do you deal withΒ it?
  3. How can you add, change or remove features to get more out of yourΒ data?

Information About theΒ Data

.info() method prints information about a DataFrame including the index dtype and columns, non-null values, and memoryΒ usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 stop_date 91741 non-null object
1 stop_time 91741 non-null object
2 county_name 0 non-null float64
3 driver_gender 86406 non-null object
4 driver_age_raw 86414 non-null float64
5 driver_age 86120 non-null float64
6 driver_race 86408 non-null object
7 violation_raw 86408 non-null object
8 violation 86408 non-null object
9 search_conducted 91741 non-null bool
10 search_type 3196 non-null object
11 stop_outcome 86408 non-null object
12 is_arrested 86408 non-null object
13 stop_duration 86408 non-null object
14 drugs_related_stop 91741 non-null bool
dtypes: bool(2), float64(3), object(10)
memory usage: 9.3+ MB

Descriptive Statistics about theΒ Data

.describe() generates descriptive statistics. Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaNΒ values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for moreΒ detail.

png

The Shape of theΒ Data

Use β€˜.shape’ to see DataFrame shape: rows,Β columns

(91741, 15)

Missing Values

Missing Data is a very big problem in real-life scenarios. In Pandas, missing data is represented by two values: NaN or None. Panas has several useful functions for detecting, removing, and replacing null values in Pandas DataFrame:Β .isna() orΒ .isnull() used to find NaN,Β .dropna() used to remove NaN, andΒ .fillna() to fill NaN with a specificΒ value.

stop_date                 0
stop_time 0
county_name 91741
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64

What does NaNΒ mean?

In computing, NaN, standing for not a number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic.

Why might a value beΒ missing?

There are many causes of missing values, Missing data can occur because of nonresponse, Attrition, governments or private entities,Β …

Why mark it as NaN? Why not mark it as a 0 or an empty string or a string saying β€œUnknown”?

We mark missing values as NaN to make them distinguish from the original dtype of theΒ feature.

Remove the column that only contains missingΒ values.

stop_date           : ==============> 0.00%
stop_time : ==============> 0.00%
county_name : ==============> 100.00%
driver_gender : ==============> 5.82%
driver_age_raw : ==============> 5.81%
driver_age : ==============> 6.13%
driver_race : ==============> 5.81%
violation_raw : ==============> 5.81%
violation : ==============> 5.81%
search_conducted : ==============> 0.00%
search_type : ==============> 96.52%
stop_outcome : ==============> 5.81%
is_arrested : ==============> 5.81%
stop_duration : ==============> 5.81%
drugs_related_stop : ==============> 0.00%

Remove missing values: If all values are NaN, drop that row orΒ column.

(91741, 14)

county_name All the data is missing. We will drop thisΒ column.

stop_date                 0
stop_time 0
driver_gender 5335
driver_age_raw 5327
driver_age 5621
driver_race 5333
violation_raw 5333
violation 5333
search_conducted 0
search_type 88545
stop_outcome 5333
is_arrested 5333
stop_duration 5333
drugs_related_stop 0
dtype: int64

Do Men or Women speed moreΒ often?

To answer this question, we need to know the portion of men and women in our dataset. After that, we need to check the portion of men and women stopped for a speed violation.

M    62895
F 23511
Name: driver_gender, dtype: int64

M 0.727901
F 0.272099
Name: driver_gender, dtype: float64

Speeding               48463
Moving violation 16224
Equipment 11020
Other 4317
Registration/plates 3432
Seat belt 2952
Name: violation, dtype: int64

Speeding 0.560862
Moving violation 0.187760
Equipment 0.127534
Other 0.049961
Registration/plates 0.039719
Seat belt 0.034164
Name: violation, dtype: float64

1: Driver Gender Distributionβ€Šβ€”β€ŠMen 62895 (73%), Women 23511 (27%) | 2: Driver Gender vs Speed Violation Distributionβ€Šβ€”β€ŠMen 62895 (68%), Women 23511Β (32%)

In this dataset, we 62895 (73%) Men and 23511 (27%)women. So, in responding to this question, we must take into consideration of the non-equivalent distribution of the data or use fractions.

The speeding violation represents 56% of all violations in ourΒ dataset.

M    32979
F 15482
Name: driver_gender, dtype: int64

M 0.680527
F 0.319473
Name: driver_gender, dtype: float64

When a man is pulled over, How often is it for speeding?

Speeding               32979
Moving violation 13020
Equipment 8533
Other 3627
Registration/plates 2419
Seat belt 2317
Name: violation, dtype: int64
Speeding               0.524350
Moving violation 0.207012
Equipment 0.135671
Other 0.057668
Registration/plates 0.038461
Seat belt 0.036839
Name: violation, dtype: float64

From 62895 Men stopped by police, 32979 (~53%) are stopped because of speeding.

When a woman is pulled over, How often is it for speeding?

Speeding               15482
Moving violation 3204
Equipment 2487
Registration/plates 1013
Other 690
Seat belt 635
Name: violation, dtype: int64
Speeding               0.658500
Moving violation 0.136277
Equipment 0.105780
Registration/plates 0.043086
Other 0.029348
Seat belt 0.027009
Name: violation, dtype: float64

From 23511 women stopped by police, 15482 (~66%) are stopped because of speeding.

The rate of Men stopped for speed is 68% and the rate at women stopped at speed is 32% which is different from the fraction of men (73%) and women (27%) in our data. So we can say that women are more likely to be stopped for speed violations thanΒ men.

Does Gender affect who gets searched during aΒ stop?

To answer this question, we need to know the number of searches conducted during allΒ stops.

False    88545
True 3196
Name: search_conducted, dtype: int64
False    0.965163
True 0.034837
Name: search_conducted, dtype: float64

From all 88545 stoping cases, the data only 3196 (3%) are searched.

M    2725
F 471
Name: driver_gender, dtype: int64
M    0.852628
F 0.147372
Name: driver_gender, dtype: float64

Does this prove causation?

  • Causation is difficult to conclude, so focus on relationships
  • Include all relevant factors when studying a relationship

From the stopped cases 2725 (85%) are men and only 471 (15%)are women. This result means that men are more likely to be searched than women during aΒ stop.

Why is β€˜search_type’ missing soΒ often?

88545
False    88545
True 3196
Name: search_conducted, dtype: int64

NaN    88545
Name: search_type, dtype: int64

search_type is missing every time the police don't search. Now we know why the search type is missing. We can fill in the missing values in the search type by Not Searched using pandas functionΒ .fillna().

Notes:

  • Verify your assumptions about yourΒ data
  • pandas functions ignore missing values byΒ default

During a search, how often is the driverΒ frisked?

search_type is a text of search types separated by commas. So, we need to separate the search type and then count the appearance ofΒ Frisk.

Incident to Arrest                                          1219
Probable Cause 891
Inventory 220
Reasonable Suspicion 197
Protective Frisk 161
Incident to Arrest,Inventory 129
Incident to Arrest,Probable Cause 106
Probable Cause,Reasonable Suspicion 75
Incident to Arrest,Inventory,Probable Cause 34
Incident to Arrest,Protective Frisk 33
Probable Cause,Protective Frisk 33
Inventory,Probable Cause 22
Incident to Arrest,Reasonable Suspicion 13
Inventory,Protective Frisk 11
Incident to Arrest,Inventory,Protective Frisk 11
Protective Frisk,Reasonable Suspicion 11
Incident to Arrest,Probable Cause,Protective Frisk 10
Incident to Arrest,Probable Cause,Reasonable Suspicion 6
Incident to Arrest,Inventory,Reasonable Suspicion 4
Inventory,Reasonable Suspicion 4
Inventory,Probable Cause,Protective Frisk 2
Inventory,Probable Cause,Reasonable Suspicion 2
Incident to Arrest,Protective Frisk,Reasonable Suspicion 1
Probable Cause,Protective Frisk,Reasonable Suspicion 1
Name: search_type, dtype: int64

We will use the Counter function from collections library to count the number of each search_type.

274
0.08573216520650813

We have 5 search types (Inventory, Reasonable Suspicion, Probable Cause, Protective Frisk, Incident to Arrest). From all searches conducted 247 (8.57%) are Protective Frisk.

Which year had the least number ofΒ stops?

The stop_date and stop_time are object types, so we need to convert them to Pandas DateTime objects to easily manipulate them. Having our dates as datetime64 object will allow us to access a lot of date and time information through theΒ .dtΒ API.

object
object
0        2005-01-02
1 2005-01-18
2 2005-01-23
3 2005-02-20
4 2005-03-14
Name: stop_date, Length: 91741, dtype: object

stop_date             datetime64[ns]
stop_time object
driver_gender object
driver_age_raw float64
driver_age float64
driver_race object
violation_raw object
violation object
search_conducted bool
search_type object
stop_outcome object
is_arrested object
stop_duration object
drugs_related_stop bool
year int64
dtype: object

2012    10970
2006 10639
2007 9476
2014 9228
2008 8752
2015 8599
2011 8126
2013 7924
2009 7908
2010 7561
2005 2558
Name: year, dtype: int64

2012 and 2006 were the two years with the highest number of arrests by the police. 2005 was the year with the fewest arrests by theΒ police.

How does drug activity change by time ofΒ day?

To answer this question, we need to check drug_related_stop column in our dataset, and to track these activities over the day, we use the stop_time column.

False    90926
True 815
Name: drugs_related_stop, dtype: int64
False    0.991116
True 0.008884
Name: drugs_related_stop, dtype: float64

From all the records we have, 815 (0.88%) of the stops are drug-related.

Most drug-related stops are between 10 p.m. and 1 a.m., which is very logical because most drug addicts consume drugs at this time of the day, but we need a larger dataset to confirm this assumption.

Do most stops occur atΒ night?

From the plots, it seems that most stops are during day time, not at night. But this is very logical because traffic in the daytime is more than traffic during nighttime.

Find the bad data in the stop_duration column and fixΒ it

stop_duration Missing Values: 5333
stop_duration Unique Values: ['0-15 Min' '16-30 Min' '30+ Min' nan '2' '1']

By default, pandas.value_counts() ignore missing values. Pass dropna=False to make it count missingΒ values.

0-15 Min     69543
16-30 Min 13635
NaN 5333
30+ Min 3228
2 1
1 1
Name: stop_duration, dtype: int64

It seems that we have two additional categories 1 and 2 with only one record for each. We will consider them as missingΒ values.

0-15 Min     69543
16-30 Min 13635
NaN 5335
30+ Min 3228
Name: stop_duration, dtype: int64

What is the mean stop_duration for each violation_raw?

For the stop_duration we have three categories: β€˜0–15 Min’, β€˜16–30 Min’, and β€˜30+Β Min’.

stop_duration Unique Values: ['0-15 Min' '16-30 Min' '30+ Min' nan '2' '1']

violation_raw Number of Unique Values: 12
violation_raw Unique Values: [
'Speeding' 'Call for Service'
'Equipment/Inspection Violation'
'Other Traffic Violation'
nan
'Registration Violation',
'Special Detail/Directed Patrol'
'APB'
'Violation of City/Town Ordinance'
'Suspicious Person'
'Motorist Assist/Courtesy'
'Warrant'
'Seatbelt Violation'
]

pngpng

Let’s create a new column in which we replace the time intervals that we have with the mean so we can apply mathematical operations on them. we will map them as follow: β€˜0–15 Min’:8, β€˜16–30 Min’:23, β€˜30+Β Min’:45.

8.0     69543
23.0 13635
45.0 3228
Name: stop_minutes, dtype: int64

violation_raw
APB 20.987342
Call for Service 22.034669
Equipment/Inspection Violation 11.460345
Motorist Assist/Courtesy 16.916256
Other Traffic Violation 13.900265
Registration Violation 13.745629
Seatbelt Violation 9.741531
Special Detail/Directed Patrol 15.061100
Speeding 10.577690
Suspicious Person 18.750000
Violation of City/Town Ordinance 13.388626
Warrant 21.400000
Name: stop_minutes, dtype: float64

png

Compare the age distributions for each violation

png

Summary

Good data analysis requires both mastering the tools (Python and Pandas) and Creative thinking (Understanding the problem we are solving and asking the right questions). In this article, we discovered how to perform data analysis. Specifically, weΒ learned:

  • One of the first steps when exploring a new data set is making sure the data types are set correctly.
  • Causation is difficult to conclude, so focus on relationships. So, Include all relevant factors when studying a relationship.
  • Use the DateTime data type for dates andΒ times.
  • Use visualization tools to help you understand trends in yourΒ data.

Links and Resources:


Mastering Data Analysis with Pandas: A Step-by-Step Guide using Stanford Open Policing Data was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓