Mastering Pandas: 21 Pandas Tips Every Data Scientist Should Know
Last Updated on March 21, 2023 by Editorial Team
Author(s): Fares Sayah
Originally published on Towards AI.
Maximize Your Pandas Skills: Essential Tips and Tricks for Mastering Data Manipulation
Whether youβre a beginner or an experienced data analyst, the 21 tips and tricks covered in this article will help you tackle any analytical task with ease. From writing clean code to avoiding reinventing the wheel, these tips will help you work smarter, not harder. Discover the hidden gems of Pandas and become a master user in no time. And if you have any additional tips to share, donβt hesitate to leave a comment.
Most of the tips are from
LinkedIn posts.
1: Print DataFrame in Markdown-friendly format
Markdown is a lightweight markup language that is used to create formatted text using a plain-text editor. Sometimes, you might want to include a table in a markdown, such as GitHub README or Latex.
If you want to print a DataFrame
in markdown format, use to_markdown()
function.
| | a | b | |---:|----:|----:| | 0 | 1 | 5 | | 1 | 2 | 6 | | 2 | 3 | 7 | | 3 | 4 | 8 |
+----+-----+-----+ | | a | b | +====+=====+=====+ | 0 | 1 | 5 | +----+-----+-----+ | 1 | 2 | 6 | +----+-----+-----+ | 2 | 3 | 7 | +----+-----+-----+ | 3 | 4 | 8 | +----+-----+-----+
2: Group rows into a list
It is common to use groupby
to get the statistics of rows in the same group, such as count, mean, median, etc. If you want to group rows into a list instead, use βlambda x: list(x)
β.
col1 col3 col2 a 1.5 [d, e] b 3.5 [f, g] c 3.0 [h]
3: DataFrame.explode()
When working with a DataFrame
, if you want to turn a string into a list and then split elements in a list into multiple rows, use the combination of str.split()
and explode()
.
a b 0 [1, 2] 11 1 [4, 5] 13
a b 0 1 11 0 2 11 1 4 13 1 5 13
4: DataFrame.copy()
Have you ever tried to make a copy of a DataFrame
user β=
β? You will not get a copy but a reference to the original DataFrame
. Thus, changing the new DataFrame
will also change the original DataFrame
.
A better way to make a copy is to use df.copy()
. Now, changing the copy will not affect the original DataFrame
.
col1 col2 0 7 4 1 8 5 2 9 6
col1 col2 0 1 4 1 2 5 2 3 6
5: Groupby().count vs Groupby( ).size
- If you want to get the count of elements in one column of a Pandas
DataFrame
, usegroupby
andcount
. - If you want to get the size of groups composed of 2 or more columns, use
groupby
andsize
instead.
col2 col1 a 1 b 2 c 2 d 1
col1 col2 a S 1 b M 1 S 1 c L 2 d L 1 dtype: int64
6: Correlation
If you want to compute the correlation between rows or columns of two DataFrame, use .corrwith()
.
a 0.94388 b 0.68313 dtype: float64
7: Cross-Tabulation
Cross-tabulation allows you to analyze the relationship between multiple variables. To turn a Pandas DataFrame
into a cross-tabulation, use pandas.crosstab()
.
person2 Ben Jone Patrick Smith Warren person1 Ben 0 0 1 1 0 Jone 0 0 0 0 1 Patrick 1 0 0 1 0 Smith 1 0 1 0 1 Warren 0 1 0 1 0
8: DataFrame.query()
It can be lengthy to filter columns of pandas DataFrame
using brackets. To shorten the filtering statements, use df.query()
instead.
fruit price 2 grape 6 3 grape 7
fruit price 2 grape 6 3 grape 7
9: Unpivot DataFrame
- If you want to unpivot a
DataFrame
from wide to long format, usepandas.melt()
. - For example, you can use
pandas.melt()
to turn multiple columns (βAldiβ, βWalmartβ, βCostcoβ) into values of one column (βstoreβ).
fruit Aldi Walmart Costco 0 apple 4 6 1 1 orange 5 7 2
fruit store value 0 apple Aldi 4 1 orange Aldi 5 2 apple Walmart 6 3 orange Walmart 7 4 apple Costco 1 5 orange Costco 2
10: Rename aggregated column
By default, aggregating a column returns the name of that column.
If you want to assign a new name to the aggregation, use name = (column, agg_method)
.
price size L 19.000 M 10.000 S 36.995
mean_price size L 19.000 M 10.000 S 36.995
11: Normalized Value Counts
- If you want to get the count of a value in a column, use
value_counts
. - However, if you want to get the percentage of a value in a column, add
normalize=True
tovalue_counts
.
S 4 M 2 L 1 XL 1 dtype: int64
S 0.500 M 0.250 L 0.125 XL 0.125 dtype: float64
12: df.transform() instead of df.count()
- To filter pandas
DataFrame
based on the occurrences of categories, you might attempt to usedf.groupby
anddf.count
. However, since the Series returned by the count method is shorter than the originalDataFrame
, you will get an error when filtering. - Instead of using
count
, usetransform
. This method will return the Series with the same length as the originalDataFrame
. Now you can filter without encountering any errors.
type value 0 A 5 1 A 3 2 O 2 4 O 7 5 A 3
13: Fill in Null Values
- If you want to fill null values in one
DataFrame
with non-null values at the same locations from anotherDataFrame
, usepandas.DataFrame.combine_first
. - In the code below, the values at the first row of
store1
are updated with the values at the first row ofstore2
.
orange apple 0 31.0 4.0 1 5.0 71.0 2 9.0 12.0
14: Value Counts Missing Values
By default, pandas.value_counts()
ignore missing values. Pass dropna=False
to make it count missing values.
15: Filter Columns in DataFrame
If you want to filter columns of a pandas DataFrame based on characters in their names, use DataFrame.filter. This can be handy if you create dummy variables and you want to select columns based on the prefix.
Temp Degree 0 Hot 35 1 Cold 3 2 Warm 15 3 Cold 2
Degree Temp_Cold Temp_Hot Temp_Warm 0 35 0 1 0 1 3 1 0 0 2 15 0 0 1 3 2 1 0 0
Temp_Cold Temp_Hot Temp_Warm 0 0 1 0 1 1 0 0 2 0 0 1 3 1 0 0
16: Convert Data Types Automatically
If you donβt know the dtypes of columns in your DataFrame, you can use convert_dtypes()
to quickly convert columns to the best possible types.
17: Assign new columns to a DataFrame
Use Dataframe.assign()
method to assign new columns to your DataFrame
, Dataframe.assign()
returns a new object (a copy) with the new columns added to the original ones. Existing columns that are re-assigned will be overwritten.
text text_len word_count weekend 0 saturday: weekend (not working day) 35 5 True 1 sunday: weekend (not working day) 33 5 True 2 monday: doctor appointment at 2:45pm. 37 5 False 3 tuesday: dentist appointment at 11:30 am. 41 6 False 4 wednesday: basketball game at 7:00pm 36 5 False 5 thursday: back home by 11:15 pm. 32 6 False 6 friday: take the train at 08:10 am. 35 7 False
18: Read HTML Tables
.read_html()
can be useful for quickly incorporating tables from various websites without figuring out how to scrape the siteβs HTML.
Year Office GOP DFL Others 0 2020 President 45.3% 52.4% 2.3% 1 2020 Senator 43.5% 48.8% 7.7% 2 2018 Governor 42.4% 53.9% 3.7% 3 2018 Senator 36.2% 60.3% 3.4% 4 2018 Senator 42.4% 53.0% 4.6%
Location July (Β°F) July (Β°C) January (Β°F) January (Β°C) 0 Minneapolis 83/64 28/18 23/7 β4/β13 1 Saint Paul 83/63 28/17 23/6 β5/β14 2 Rochester 82/63 28/17 23/3 β5/β16 3 Duluth 76/55 24/13 19/1 β7/β17 4 St. Cloud 81/58 27/14 18/β1 β7/β18
19: βnlargestβ and βnsmallest β
Use .nlargest()
and .nsmallest()
to sort columns in DataFrame
based on a specific column instead of using .sort_values()
Data Link: IMDB Rating
20: Create a Rank Column
Pandas DataFrame.rank()
method returns a rank of every respective index of a series passed. The rank is returned based on position after sorting.
In the following example, a new Rank column is created, which ranks the Student by their Marks.
Students Marks 0 John 80 1 Smith 56 2 Patrick 95 3 Bob 75 4 Jose 45
Students Marks Rank 0 John 80 2.0 1 Smith 56 4.0 2 Patrick 95 1.0 3 Bob 75 3.0 4 Jose 45 5.0
21: Color Values in DataFrame
Color styling adds more readability to the end user. Pandas have the style property that follows us to apply different styles to DataFrames.
Conclusion:
Pandas is a powerful, flexible, and easy-to-use open-source library for data analysis and manipulation, built on the foundation of Python programming. Becoming a master of Pandas can significantly enhance your data analysis skills, and knowing the best practices can save you considerable time and effort. With its fast performance and intuitive syntax, Pandas is the go-to choice for data scientists and analysts worldwide. By mastering Pandas, youβll be able to manipulate, analyze, and visualize data with ease, giving you a significant advantage in todayβs data-driven world.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI