Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Mastering Pandas: 21 Pandas Tips Every Data Scientist Should Know
Latest   Machine Learning

Mastering Pandas: 21 Pandas Tips Every Data Scientist Should Know

Last Updated on March 21, 2023 by Editorial Team

Author(s): Fares Sayah

Originally published on Towards AI.

Maximize Your Pandas Skills: Essential Tips and Tricks for Mastering Data Manipulation

Mastering Pandas: 21 Pandas Tips Every Data Scientist Should Know
Photo by Pascal Müller on Unsplash

Whether you’re a beginner or an experienced data analyst, the 21 tips and tricks covered in this article will help you tackle any analytical task with ease. From writing clean code to avoiding reinventing the wheel, these tips will help you work smarter, not harder. Discover the hidden gems of Pandas and become a master user in no time. And if you have any additional tips to share, don’t hesitate to leave a comment.

Most of the tips are from

LinkedIn posts.

1: Print DataFrame in Markdown-friendly format

Markdown is a lightweight markup language that is used to create formatted text using a plain-text editor. Sometimes, you might want to include a table in a markdown, such as GitHub README or Latex.

If you want to print a DataFrame in markdown format, use to_markdown() function.

| | a | b |
|---:|----:|----:|
| 0 | 1 | 5 |
| 1 | 2 | 6 |
| 2 | 3 | 7 |
| 3 | 4 | 8 |
+----+-----+-----+
| | a | b |
+====+=====+=====+
| 0 | 1 | 5 |
+----+-----+-----+
| 1 | 2 | 6 |
+----+-----+-----+
| 2 | 3 | 7 |
+----+-----+-----+
| 3 | 4 | 8 |
+----+-----+-----+

2: Group rows into a list

It is common to use groupby to get the statistics of rows in the same group, such as count, mean, median, etc. If you want to group rows into a list instead, use “lambda x: list(x)”.

 col1 col3
col2 
a 1.5 [d, e]
b 3.5 [f, g]
c 3.0 [h]

3: DataFrame.explode()

When working with a DataFrame, if you want to turn a string into a list and then split elements in a list into multiple rows, use the combination of str.split() and explode().

 a b
0 [1, 2] 11
1 [4, 5] 13
 a b
0 1 11
0 2 11
1 4 13
1 5 13

4: DataFrame.copy()

Have you ever tried to make a copy of a DataFrame user “=”? You will not get a copy but a reference to the original DataFrame. Thus, changing the new DataFrame will also change the original DataFrame.

A better way to make a copy is to use df.copy(). Now, changing the copy will not affect the original DataFrame.

 col1 col2
0 7 4
1 8 5
2 9 6
 col1 col2
0 1 4
1 2 5
2 3 6

5: Groupby().count vs Groupby( ).size

  • If you want to get the count of elements in one column of a Pandas DataFrame, use groupby and count.
  • If you want to get the size of groups composed of 2 or more columns, use groupby and size instead.
 col2
col1 
a 1
b 2
c 2
d 1
col1 col2
a S 1
b M 1
 S 1
c L 2
d L 1
dtype: int64

6: Correlation

If you want to compute the correlation between rows or columns of two DataFrame, use .corrwith().

a 0.94388
b 0.68313
dtype: float64

7: Cross-Tabulation

Cross-tabulation allows you to analyze the relationship between multiple variables. To turn a Pandas DataFrame into a cross-tabulation, use pandas.crosstab().

person2 Ben Jone Patrick Smith Warren
person1 
Ben 0 0 1 1 0
Jone 0 0 0 0 1
Patrick 1 0 0 1 0
Smith 1 0 1 0 1
Warren 0 1 0 1 0

8: DataFrame.query()

It can be lengthy to filter columns of pandas DataFrame using brackets. To shorten the filtering statements, use df.query() instead.

 fruit price
2 grape 6
3 grape 7
 fruit price
2 grape 6
3 grape 7

9: Unpivot DataFrame

  • If you want to unpivot a DataFrame from wide to long format, use pandas.melt().
  • For example, you can use pandas.melt() to turn multiple columns (“Aldi”, “Walmart”, “Costco”) into values of one column (“store”).
 fruit Aldi Walmart Costco
0 apple 4 6 1
1 orange 5 7 2
 fruit store value
0 apple Aldi 4
1 orange Aldi 5
2 apple Walmart 6
3 orange Walmart 7
4 apple Costco 1
5 orange Costco 2

10: Rename aggregated column

By default, aggregating a column returns the name of that column.

If you want to assign a new name to the aggregation, use name = (column, agg_method).

 price
size 
L 19.000
M 10.000
S 36.995
 mean_price
size 
L 19.000
M 10.000
S 36.995

11: Normalized Value Counts

  • If you want to get the count of a value in a column, use value_counts.
  • However, if you want to get the percentage of a value in a column, add normalize=True to value_counts.
S 4
M 2
L 1
XL 1
dtype: int64
S 0.500
M 0.250
L 0.125
XL 0.125
dtype: float64

12: df.transform() instead of df.count()

  • To filter pandas DataFrame based on the occurrences of categories, you might attempt to use df.groupby and df.count. However, since the Series returned by the count method is shorter than the original DataFrame, you will get an error when filtering.
  • Instead of using count, use transform. This method will return the Series with the same length as the original DataFrame. Now you can filter without encountering any errors.
 type value
0 A 5
1 A 3
2 O 2
4 O 7
5 A 3

13: Fill in Null Values

  • If you want to fill null values in one DataFrame with non-null values at the same locations from another DataFrame, use pandas.DataFrame.combine_first.
  • In the code below, the values at the first row of store1 are updated with the values at the first row of store2.
 orange apple
0 31.0 4.0
1 5.0 71.0
2 9.0 12.0

14: Value Counts Missing Values

By default, pandas.value_counts() ignore missing values. Pass dropna=False to make it count missing values.

15: Filter Columns in DataFrame

If you want to filter columns of a pandas DataFrame based on characters in their names, use DataFrame.filter. This can be handy if you create dummy variables and you want to select columns based on the prefix.

 Temp Degree
0 Hot 35
1 Cold 3
2 Warm 15
3 Cold 2
 Degree Temp_Cold Temp_Hot Temp_Warm
0 35 0 1 0
1 3 1 0 0
2 15 0 0 1
3 2 1 0 0
 Temp_Cold Temp_Hot Temp_Warm
0 0 1 0
1 1 0 0
2 0 0 1
3 1 0 0

16: Convert Data Types Automatically

If you don’t know the dtypes of columns in your DataFrame, you can use convert_dtypes() to quickly convert columns to the best possible types.

17: Assign new columns to a DataFrame

Use Dataframe.assign() method to assign new columns to your DataFrame, Dataframe.assign() returns a new object (a copy) with the new columns added to the original ones. Existing columns that are re-assigned will be overwritten.

 text text_len word_count weekend
0 saturday: weekend (not working day) 35 5 True
1 sunday: weekend (not working day) 33 5 True
2 monday: doctor appointment at 2:45pm. 37 5 False
3 tuesday: dentist appointment at 11:30 am. 41 6 False
4 wednesday: basketball game at 7:00pm 36 5 False
5 thursday: back home by 11:15 pm. 32 6 False
6 friday: take the train at 08:10 am. 35 7 False

18: Read HTML Tables

.read_html() can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML.

 Year Office GOP DFL Others
0 2020 President 45.3% 52.4% 2.3%
1 2020 Senator 43.5% 48.8% 7.7%
2 2018 Governor 42.4% 53.9% 3.7%
3 2018 Senator 36.2% 60.3% 3.4%
4 2018 Senator 42.4% 53.0% 4.6%
 Location July (°F) July (°C) January (°F) January (°C)
0 Minneapolis 83/64 28/18 23/7 −4/−13
1 Saint Paul 83/63 28/17 23/6 −5/−14
2 Rochester 82/63 28/17 23/3 −5/−16
3 Duluth 76/55 24/13 19/1 −7/−17
4 St. Cloud 81/58 27/14 18/−1 −7/−18

19: ‘nlargest’ and ‘nsmallest ‘

Use .nlargest() and .nsmallest() to sort columns in DataFrame based on a specific column instead of using .sort_values()

Data Link: IMDB Rating

20: Create a Rank Column

Pandas DataFrame.rank() method returns a rank of every respective index of a series passed. The rank is returned based on position after sorting.

In the following example, a new Rank column is created, which ranks the Student by their Marks.

 Students Marks
0 John 80
1 Smith 56
2 Patrick 95
3 Bob 75
4 Jose 45
 Students Marks Rank
0 John 80 2.0
1 Smith 56 4.0
2 Patrick 95 1.0
3 Bob 75 3.0
4 Jose 45 5.0

21: Color Values in DataFrame

Color styling adds more readability to the end user. Pandas have the style property that follows us to apply different styles to DataFrames.

Conclusion:

Pandas is a powerful, flexible, and easy-to-use open-source library for data analysis and manipulation, built on the foundation of Python programming. Becoming a master of Pandas can significantly enhance your data analysis skills, and knowing the best practices can save you considerable time and effort. With its fast performance and intuitive syntax, Pandas is the go-to choice for data scientists and analysts worldwide. By mastering Pandas, you’ll be able to manipulate, analyze, and visualize data with ease, giving you a significant advantage in today’s data-driven world.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓