
Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)
Last Updated on May 8, 2025 by Editorial Team
Author(s): Gift Ojeabulu
Originally published on Towards AI.
Outline
- Introduction
- The Data Size Decision Framework & Comprehensive Decision Flowchart
- A Diagrammatic representation based on Team Syntax Preference, and Performance or Integration Requirements
- Real-World Examples: Log file Analysis, E-commerce instance and IOT case.
- Conclusion
Introduction
Data processing difficulties increase with the amount of data. Even when they are no longer the best fit, many data scientist, engineers and analysts continue to use well-known technologies like Pandas.
This guide will explore a quick approach to choosing the right data processing tool, whether Pandas, Polars, DuckDB, or PySpark, based on your data size, performance needs, and workflow preferences.
The Data Size Decision Framework & Flowchart
Let us break down when to use each tool based primarily on data size and several criteria:
Small Data (< 1GB)
If your dataset is under 1GB, Pandas is typically the best choice. Itβs easy to use, widely adopted, and well-supported within the Python ecosystem. Unless you have very specific performance needs, Pandas will efficiently handle tasks like quick exploratory analysis and visualizations.
Use Pandas when:
- Your dataset fits comfortably in memory.
- You are doing a quick exploratory data analysis.
- You need the massive ecosystem of Pandas-compatible libraries.
- Your workflows involve lots of data visualization.
import pandas as pd
df = pd.read_csv("small_data.csv") # Under 1GB works fine
Pandas is still king for smaller datasets because of its rich ecosystem, extensive documentation, and widespread adoption. For these sizes, the performance gains from other tools may not justify the learning curve.
Medium Data (1GB to 50GB)
When your data falls between 1GB and 50GB, youβll need something faster and more efficient than Pandas. Your choice between Polars and DuckDB depends on your coding preference and workflow.
Use Polars when:
- You need more speed than Pandas.
- Memory efficiency is important.
- You are working with complex data transformations.
- You prefer a Python-centric workflow.
import polars as pl
df = pl.read_csv("medium_data.csv") # Fast and memory efficient
Use DuckDB when:
- You prefer writing SQL queries.
- You are performing complex aggregations or joins.
- Your workflows are analytics-heavy.
- You want to query data directly from files.
# Import the DuckDB library for high-performance analytics
import duckdb
# Execute a SQL query against the CSV file and store results in a pandas DataFrame
# - Directly queries the CSV file without explicit loading, leveraging DuckDB's zero-copy architecture
# - Filters for records where 'value' column exceeds 100 for downstream analysis
# - Returns results as pandas DataFrame for compatibility with visualization libraries
df = duckdb.query("SELECT * FROM 'medium_large_data.csv' WHERE value > 100").df()
Big Data (Over 50GB)
When your data exceeds 50GB, PySpark becomes the go-to tool. Itβs designed for distributed computing and can efficiently handle datasets that span multiple machines.
Use PySpark when:
- Your data exceeds single-machine capacity.
- Distributed processing is necessary.
- You need fault tolerance.
- Processing time is more important than setup complexity.
# Import SparkSession from pyspark.sql module, which is the entry point to Spark SQL functionality
from pyspark.sql import SparkSession
# Initialize a Spark session with meaningful application name for monitoring/logging purposes
# - SparkSession.builder allows fluent configuration of Spark settings
# - appName defines the name shown in Spark UI and logs for easier identification
# - getOrCreate() either creates a new session or returns an existing one to avoid redundancy
spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate()
# Load CSV data into a Spark DataFrame with automatic schema inference
# - Distributed reading of potentially large CSV file leveraging Spark's parallel processing
# - header=True treats first row as column names instead of data
# - inferSchema=True analyzes data to determine appropriate column types, avoiding default string types
# - Returns a distributed DataFrame that can scale across a cluster for subsequent operations
df = spark.read.csv("really_big_data.csv", header=True, inferSchema=True)
PySpark allows you to distribute processing across multiple machines, handling datasets from hundreds of gigabytes to petabytes.
Additional factors to consider
While data size is the primary factor, several other considerations should influence your choice:
- Need to run on multiple machines? β
PySpark
- Working with data scientists who know Pandas? β
Polars (easiest transition)
- Need the best performance on a single machine? β
DuckDB
orPolars
- Need to integrate with existing SQL workflows? β
DuckDB
- Powering real-time dashboards? β
DuckDB
- Operating under memory constraints? β
Polars
orDuckDB
- Preparing data for BI dashboards at scale? β
PySpark
orDuckDB
By systematically evaluating these factors, users can make more informed decisions about which data processing tool or combination of tools best fits their specific project requirements and team capabilities.
Real-World Examples
Example 1: Log File Analysis (10GB)
Processing server logs to extract error patterns:
- Bad choice: Pandas (slow, memory issues).
- Good choice: DuckDB (can directly query the log files).
Sample code:
import duckdb
error_counts = duckdb.query("""
SELECT error_code, COUNT(*) as count
FROM 'server_logs.csv'
GROUP BY error_code
ORDER BY count DESC
""").df()
Example 2: E-commerce Data (30GB)
Analyzing customer purchase patterns:
- Bad choice: Pandas (will crash)
- Good choice: Polars (for transformations) + DuckDB (for aggregations)
Sample code:
import polars as pl
import duckdb
# Load and transform with Polars
df = pl.scan_csv("transactions.csv")
df = df.filter(pl.col("purchase_date") > "2023-01-01")
# Convert to DuckDB for complex aggregation
duckdb.register("transactions", df.collect())
customer_segments = duckdb.query("""
SELECT customer_id,
SUM(amount) as total_spent,
COUNT(*) as num_transactions,
AVG(amount) as avg_transaction
FROM transactions
GROUP BY customer_id
HAVING COUNT(*) > 5
""").df()
Example 3: Sensor Data (100GB+)
Processing IoT sensor data from multiple devices:
- Bad choice: Any single-machine solution
- Good choice: PySpark
Sample code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, avg
spark = SparkSession.builder.appName("SensorAnalysis").getOrCreate()
sensor_data = spark.read.parquet("s3://sensors/data/")
# Calculate rolling averages by sensor
hourly_averages = sensor_data \
.withWatermark("timestamp", "1 hour") \
.groupBy(
window(sensor_data.timestamp, "1 hour"),
sensor_data.sensor_id
) \
.agg(avg("temperature").alias("avg_temp"))
In Part 2 of this series, we will dive deeper into the medium data range (1β50GB) with a detailed comparison of DuckDB vs Polars, including performance benchmarks across common data operations. We will also explore exactly when SQL outperforms Python (and vice versa) for several data tasks.
Conclusion
As your data scales, so should your tools. While Pandas remains a solid choice for datasets under 1GB, larger volumes call for more specialized solutions. Polars shines for Python users handling mid-sized data, DuckDB is ideal for those who prefer SQL and need fast analytical queries, and PySpark is built for massive datasets that require distributed processing.
The best part? These tools arenβt mutually exclusive, many modern data workflows combine them, using Polars for fast data wrangling, DuckDB for lightweight analytics, and PySpark for heavy-duty tasks. Ultimately, choosing the right tool isnβt just about todayβs dataset, it is about ensuring your workflow can grow with your data tomorrow.
Connect with me on LinkedIn
Connect with me on Twitter
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI