Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)
Data Analysis   Data Engineering   Data Science   Latest   Machine Learning

Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)

Last Updated on May 8, 2025 by Editorial Team

Author(s): Gift Ojeabulu

Originally published on Towards AI.

Image by author

Outline

  • Introduction
  • The Data Size Decision Framework & Comprehensive Decision Flowchart
  • A Diagrammatic representation based on Team Syntax Preference, and Performance or Integration Requirements
  • Real-World Examples: Log file Analysis, E-commerce instance and IOT case.
  • Conclusion

Introduction

Data processing difficulties increase with the amount of data. Even when they are no longer the best fit, many data scientist, engineers and analysts continue to use well-known technologies like Pandas.

This guide will explore a quick approach to choosing the right data processing tool, whether Pandas, Polars, DuckDB, or PySpark, based on your data size, performance needs, and workflow preferences.

The Data Size Decision Framework & Flowchart

Let us break down when to use each tool based primarily on data size and several criteria:

Image by the author: Comprehensive Tool Selection Guide for Data Workflows Based on Data Volume, Team Familiarity, and Infrastructure Needs.

Small Data (< 1GB)

If your dataset is under 1GB, Pandas is typically the best choice. It’s easy to use, widely adopted, and well-supported within the Python ecosystem. Unless you have very specific performance needs, Pandas will efficiently handle tasks like quick exploratory analysis and visualizations.

Use Pandas when:

  • Your dataset fits comfortably in memory.
  • You are doing a quick exploratory data analysis.
  • You need the massive ecosystem of Pandas-compatible libraries.
  • Your workflows involve lots of data visualization.
import pandas as pd
df = pd.read_csv("small_data.csv") # Under 1GB works fine

Pandas is still king for smaller datasets because of its rich ecosystem, extensive documentation, and widespread adoption. For these sizes, the performance gains from other tools may not justify the learning curve.

Medium Data (1GB to 50GB)

When your data falls between 1GB and 50GB, you’ll need something faster and more efficient than Pandas. Your choice between Polars and DuckDB depends on your coding preference and workflow.

Use Polars when:

  • You need more speed than Pandas.
  • Memory efficiency is important.
  • You are working with complex data transformations.
  • You prefer a Python-centric workflow.
import polars as pl
df = pl.read_csv("medium_data.csv") # Fast and memory efficient

Use DuckDB when:

  • You prefer writing SQL queries.
  • You are performing complex aggregations or joins.
  • Your workflows are analytics-heavy.
  • You want to query data directly from files.
# Import the DuckDB library for high-performance analytics
import duckdb

# Execute a SQL query against the CSV file and store results in a pandas DataFrame
# - Directly queries the CSV file without explicit loading, leveraging DuckDB's zero-copy architecture
# - Filters for records where 'value' column exceeds 100 for downstream analysis
# - Returns results as pandas DataFrame for compatibility with visualization libraries
df = duckdb.query("SELECT * FROM 'medium_large_data.csv' WHERE value > 100").df()

Big Data (Over 50GB)

When your data exceeds 50GB, PySpark becomes the go-to tool. It’s designed for distributed computing and can efficiently handle datasets that span multiple machines.

Use PySpark when:

  • Your data exceeds single-machine capacity.
  • Distributed processing is necessary.
  • You need fault tolerance.
  • Processing time is more important than setup complexity.
# Import SparkSession from pyspark.sql module, which is the entry point to Spark SQL functionality
from pyspark.sql import SparkSession

# Initialize a Spark session with meaningful application name for monitoring/logging purposes
# - SparkSession.builder allows fluent configuration of Spark settings
# - appName defines the name shown in Spark UI and logs for easier identification
# - getOrCreate() either creates a new session or returns an existing one to avoid redundancy
spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate()

# Load CSV data into a Spark DataFrame with automatic schema inference
# - Distributed reading of potentially large CSV file leveraging Spark's parallel processing
# - header=True treats first row as column names instead of data
# - inferSchema=True analyzes data to determine appropriate column types, avoiding default string types
# - Returns a distributed DataFrame that can scale across a cluster for subsequent operations
df = spark.read.csv("really_big_data.csv", header=True, inferSchema=True)

PySpark allows you to distribute processing across multiple machines, handling datasets from hundreds of gigabytes to petabytes.

Image by author: Decision Flowchart for Selecting a Data Processing Tool Based on Data Size, Team Syntax Preference, and Performance or Integration Requirements 🟦 DuckDB 🟩 Polars 🟨 Pandas 🟥 PySpark

Additional factors to consider

While data size is the primary factor, several other considerations should influence your choice:

  • Need to run on multiple machines? β†’ PySpark
  • Working with data scientists who know Pandas? β†’ Polars (easiest transition)
  • Need the best performance on a single machine? β†’ DuckDB or Polars
  • Need to integrate with existing SQL workflows? β†’ DuckDB
  • Powering real-time dashboards? β†’ DuckDB
  • Operating under memory constraints? β†’ Polars or DuckDB
  • Preparing data for BI dashboards at scale? β†’ PySpark or DuckDB

By systematically evaluating these factors, users can make more informed decisions about which data processing tool or combination of tools best fits their specific project requirements and team capabilities.

Real-World Examples

Example 1: Log File Analysis (10GB)

Processing server logs to extract error patterns:

  • Bad choice: Pandas (slow, memory issues).
  • Good choice: DuckDB (can directly query the log files).

Sample code:

import duckdb
error_counts = duckdb.query("""
SELECT error_code, COUNT(*) as count
FROM 'server_logs.csv'
GROUP BY error_code
ORDER BY count DESC
"""
).df()

Example 2: E-commerce Data (30GB)

Analyzing customer purchase patterns:

  • Bad choice: Pandas (will crash)
  • Good choice: Polars (for transformations) + DuckDB (for aggregations)

Sample code:

import polars as pl
import duckdb

# Load and transform with Polars
df = pl.scan_csv("transactions.csv")
df = df.filter(pl.col("purchase_date") > "2023-01-01")

# Convert to DuckDB for complex aggregation
duckdb.register("transactions", df.collect())
customer_segments = duckdb.query("""
SELECT customer_id,
SUM(amount) as total_spent,
COUNT(*) as num_transactions,
AVG(amount) as avg_transaction
FROM transactions
GROUP BY customer_id
HAVING COUNT(*) > 5
"""
).df()

Example 3: Sensor Data (100GB+)

Processing IoT sensor data from multiple devices:

  • Bad choice: Any single-machine solution
  • Good choice: PySpark

Sample code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, avg

spark = SparkSession.builder.appName("SensorAnalysis").getOrCreate()

sensor_data = spark.read.parquet("s3://sensors/data/")

# Calculate rolling averages by sensor
hourly_averages = sensor_data \
.withWatermark("timestamp", "1 hour") \
.groupBy(
window(sensor_data.timestamp, "1 hour"),
sensor_data.sensor_id
) \
.agg(avg("temperature").alias("avg_temp"))

In Part 2 of this series, we will dive deeper into the medium data range (1–50GB) with a detailed comparison of DuckDB vs Polars, including performance benchmarks across common data operations. We will also explore exactly when SQL outperforms Python (and vice versa) for several data tasks.

Conclusion

As your data scales, so should your tools. While Pandas remains a solid choice for datasets under 1GB, larger volumes call for more specialized solutions. Polars shines for Python users handling mid-sized data, DuckDB is ideal for those who prefer SQL and need fast analytical queries, and PySpark is built for massive datasets that require distributed processing.

The best part? These tools aren’t mutually exclusive, many modern data workflows combine them, using Polars for fast data wrangling, DuckDB for lightweight analytics, and PySpark for heavy-duty tasks. Ultimately, choosing the right tool isn’t just about today’s dataset, it is about ensuring your workflow can grow with your data tomorrow.

Connect with me on LinkedIn

Connect with me on Twitter

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓