Pandas vs Polars? Bid Adieu to Pandas and Switch To Polars!

Last Updated on April 22, 2024 by Editorial Team

Author(s): Saankhya Mondal

Originally published on Towards AI.

Dealing with tabular data? Pandas is the first Python library you’ll come across when dealing with tabular data preprocessing. It’s one of the most popular libraries used in Data Science. Pandas is the go-to framework for working with small or medium-sized CSV files. However, as the size of the dataset keeps increasing, Pandas’ performance starts deteriorating. The main disadvantage of Pandas is that it uses only a single thread (hence, only one CPU core) to perform operations and requires you to store the entire data frame in memory. Therefore, Pandas is slow.

Enter Polars. Polars is designed for parallelism as it can utilize all the available cores in the CPU to perform data processing. Hence, it’s blazing fast. It optimizes queries to reduce unneeded work/memory allocations and can handle datasets much larger than the available RAM. Polars is written in Rust which gives it C/C++ performance and allows it to fully control the critical parts in a query engine.

One can seamlessly integrate Polars into their development code as Polars is easy to use and has a similar API as Pandas and Pyspark. It supports most commonly used file types like .csv and .parquet and databases like MySQL and Postgres.

Let’s see how faster Polars is. For comparing the two libraries, I’ve created a dummy dataset containing 1 million rows. Each row represents a shot taken by a football player, the attributes of the shot, and whether it resulted in a goal or not. The size of the dataset is 1.2 GB.

Dataset Creation

import pandas as pd
import numpy as np
import tqdm
import uuid

data = []
body_part = ["right_foot", "left_foot", "header", "other"]
position = ["attacker", "midfielder", "defender"]
for i in tqdm.tqdm(range(10000000)):
 data.append({
 "shot_id" : str(uuid.uuid4()),
 "player_id" : np.random.randint(500),
 "loc_x" : np.random.randn(),
 "loc_y" : np.random.randn(),
 "body_part" : body_part[np.random.randint(4)],
 "position" : position[np.random.randint(3)],
 "goal" : np.random.binomial(1, 0.11),
 "xG" : np.random.uniform(0, 1)
 })
data = pd.DataFrame(data)
data.to_csv("data.csv", index=None)

Pandas vs Polars — Read .csv file

The code to read a .csv file has been shown in the code snippet below.

# Pandas
df_pd = pd.read_csv("data.csv")

# Polars
df_pl = pl.read_csv("data.csv")

The execution times are averaged over 10 runs. Polars is 8.93 times faster while reading a 1.2GB .csv file containing 1 million rows!

Pandas vs Polars — Perform Filter operation

Suppose we want to deal with the players who are attackers. We perform filter operation.

# Pandas
df_pd_filter = df_pd[df_pd["position"] == "attacker"]

# Polars
df_pl_filter = df_pl.filter(pl.col("position") == "attacker")

Polars is 13.64 times faster than Pandas while performing filtering operation!

Pandas vs Polars — Perform Group By operation

Suppose we want to see the total number of goals scored by each player from each body part and want the results to be in a clean and sorted order. The following code block does the same.

# Pandas
df_agg_pd = (df_pd
 .groupby([
 df_pd["player_id"], 
 df_pd["body_part"]
 ]
 )
 .agg(
 total_goals = ("goal", "sum"),
 total_xG = ("xG", "sum")
 )
 .reset_index()
 .sort_values(
 by=["total_goals", "total_xG", "player_id", "body_part"], 
 ascending=[False, False, True, True]
 )
)

# Polars
df_agg_pl = (df_pl
 .group_by(
 [
 pl.col("player_id"), 
 pl.col("body_part")
 ]
 )
 .agg(
 pl.sum("goal").alias("total_goals"), 
 pl.sum("xG").alias("total_xG")
 )
 .sort(
 by=["total_goals", "total_xG", "player_id", "body_part"], 
 descending=[True, True, False, False]
 )
)

The execution times are averaged over 10 runs. Polars is 1.3 times faster here.

Polars is 1.3 times faster in case of group by operation.

Pandas vs Polars — The entire pipeline

As a Data Scientist, I would want my entire pipeline to execute quickly so that I can start training my favorite machine learning model on the dataset. Here comes the interesting part. Polars comes with a lazy API and a dataframe type called LazyFrame! Using the lazy API, Polars doesn’t execute each query line by line; instead, it processes the entire query from start to finish. The lazy API enables automatic query optimization through the query optimizer. Let’s compare three techniques.

# Pandas
df_pd = pd.read_csv("data.csv")
df_pd = df_pd[df_pd["position"] == "attacker"]
df_agg_pd = (
 df_pd
 .groupby([
 df_pd["player_id"], 
 df_pd["body_part"]
 ]
 )
 .agg(
 total_goals = ("goal", "sum"),
 total_xG = ("xG", "sum")
 )
 .reset_index()
 .sort_values(
 by=["total_goals", "total_xG", "player_id", "body_part"], 
 ascending=[False, False, True, True]
 )
)

# Polars
df_agg_pl = (
 pl.read_csv("data.csv")
 .filter(pl.col("position") == "attacker")
 .group_by(
 [
 pl.col("player_id"), 
 pl.col("body_part")
 ]
 )
 .agg(
 pl.sum("goal").alias("total_goals"), 
 pl.sum("xG").alias("total_xG")
 )
 .sort(
 by=["total_goals", "total_xG", "player_id", "body_part"], 
 descending=[True, True, False, False]
 )
)

# Polars (Lazy)
df_agg_pl = (
 pl.scan_csv("data.csv")
 .filter(pl.col("position") == "attacker")
 .group_by(
 [
 pl.col("player_id"), 
 pl.col("body_part")
 ]
 )
 .agg(
 pl.sum("goal").alias("total_goals"), 
 pl.sum("xG").alias("total_xG")
 )
 .sort(
 by=["total_goals", "total_xG", "player_id", "body_part"], 
 descending=[True, True, False, False]
 )
)
df_agg_pl = df_agg_pl.collect()

Note that Polars lazy API uses scan_csv instead of read_csv to create a query computation graph. Data processing takes place only when we call collect(). Polars lazy API is 13.24 times faster than Pandas. It is over 1.5 times faster than Polars! Incredible, isn’t it?

Polars support interaction with SQL! The following code snippet shows a simple example.

ctx = pl.SQLContext(population=df_pl, eager_execution=True)
query = """
 select player_id, body_part, sum(goal) as total_goals, sum(xG) as total_xG 
 from population
 group by player_id, body_part
 order by total_goals desc, total_xG desc, player_id, body_part
"""
print(ctx.execute(query))

Conclusion

Polars is blazing fast, thanks to its multi-threading capability! You can take it up a notch When you want to run your entire pipeline, right from reading you data file and performing data processing operations, you can speed up your pipeline 13 times by using lazy Polars!

For more info, refer documentation — https://docs.pola.rs/.

Code for the analysis and plots can be found here — https://github.com/sm823zw/pandas-vs-polars

Thank you for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Pandas vs Polars? Bid Adieu to Pandas and Switch To Polars!

Author(s): Saankhya Mondal

Dataset Creation

Pandas vs Polars — Read .csv file

Pandas vs Polars — Perform Filter operation

Pandas vs Polars — Perform Group By operation

Pandas vs Polars — The entire pipeline

Conclusion

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Pandas vs Polars? Bid Adieu to Pandas and Switch To Polars!

Author(s): Saankhya Mondal

Dataset Creation

Pandas vs Polars — Read .csv file

Pandas vs Polars — Perform Filter operation

Pandas vs Polars — Perform Group By operation

Pandas vs Polars — The entire pipeline

Conclusion

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement