Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)

Last Updated on May 8, 2025 by Editorial Team

Author(s): Gift Ojeabulu

Originally published on Towards AI.

Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1) — Image by author

Outline

Introduction
The Data Size Decision Framework & Comprehensive Decision Flowchart
A Diagrammatic representation based on Team Syntax Preference, and Performance or Integration Requirements
Real-World Examples: Log file Analysis, E-commerce instance and IOT case.
Conclusion

Introduction

Data processing difficulties increase with the amount of data. Even when they are no longer the best fit, many data scientist, engineers and analysts continue to use well-known technologies like Pandas.

This guide will explore a quick approach to choosing the right data processing tool, whether Pandas, Polars, DuckDB, or PySpark, based on your data size, performance needs, and workflow preferences.

The Data Size Decision Framework & Flowchart

Let us break down when to use each tool based primarily on data size and several criteria:

Image by the author: Comprehensive Tool Selection Guide for Data Workflows Based on **Data Volume**, **Team Familiarity**, and **Infrastructure Needs**.

Small Data (< 1GB)

If your dataset is under 1GB, Pandas is typically the best choice. It’s easy to use, widely adopted, and well-supported within the Python ecosystem. Unless you have very specific performance needs, Pandas will efficiently handle tasks like quick exploratory analysis and visualizations.

Use Pandas when:

Your dataset fits comfortably in memory.
You are doing a quick exploratory data analysis.
You need the massive ecosystem of Pandas-compatible libraries.
Your workflows involve lots of data visualization.

import pandas as pd
df = pd.read_csv("small_data.csv") # Under 1GB works fine

Pandas is still king for smaller datasets because of its rich ecosystem, extensive documentation, and widespread adoption. For these sizes, the performance gains from other tools may not justify the learning curve.

Medium Data (1GB to 50GB)

When your data falls between 1GB and 50GB, you’ll need something faster and more efficient than Pandas. Your choice between Polars and DuckDB depends on your coding preference and workflow.

Use Polars when:

You need more speed than Pandas.
Memory efficiency is important.
You are working with complex data transformations.
You prefer a Python-centric workflow.

import polars as pl
df = pl.read_csv("medium_data.csv") # Fast and memory efficient

Use DuckDB when:

You prefer writing SQL queries.
You are performing complex aggregations or joins.
Your workflows are analytics-heavy.
You want to query data directly from files.

# Import the DuckDB library for high-performance analytics
import duckdb

# Execute a SQL query against the CSV file and store results in a pandas DataFrame
# - Directly queries the CSV file without explicit loading, leveraging DuckDB's zero-copy architecture
# - Filters for records where 'value' column exceeds 100 for downstream analysis
# - Returns results as pandas DataFrame for compatibility with visualization libraries
df = duckdb.query("SELECT * FROM 'medium_large_data.csv' WHERE value > 100").df()

Big Data (Over 50GB)

When your data exceeds 50GB, PySpark becomes the go-to tool. It’s designed for distributed computing and can efficiently handle datasets that span multiple machines.

Use PySpark when:

Your data exceeds single-machine capacity.
Distributed processing is necessary.
You need fault tolerance.
Processing time is more important than setup complexity.

# Import SparkSession from pyspark.sql module, which is the entry point to Spark SQL functionality
from pyspark.sql import SparkSession

# Initialize a Spark session with meaningful application name for monitoring/logging purposes
# - SparkSession.builder allows fluent configuration of Spark settings
# - appName defines the name shown in Spark UI and logs for easier identification
# - getOrCreate() either creates a new session or returns an existing one to avoid redundancy
spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate()

# Load CSV data into a Spark DataFrame with automatic schema inference
# - Distributed reading of potentially large CSV file leveraging Spark's parallel processing
# - header=True treats first row as column names instead of data
# - inferSchema=True analyzes data to determine appropriate column types, avoiding default string types
# - Returns a distributed DataFrame that can scale across a cluster for subsequent operations
df = spark.read.csv("really_big_data.csv", header=True, inferSchema=True)

PySpark allows you to distribute processing across multiple machines, handling datasets from hundreds of gigabytes to petabytes.

Image by author: Decision Flowchart for Selecting a Data Processing Tool Based on **Data Size**, **Team Syntax Preference**, and **Performance** or **Integration Requirements** 🟦 **DuckDB** 🟩 **Polars** 🟨 **Pandas** 🟥 **PySpark**

Additional factors to consider

While data size is the primary factor, several other considerations should influence your choice:

Need to run on multiple machines? → PySpark
Working with data scientists who know Pandas? → Polars (easiest transition)
Need the best performance on a single machine? → DuckDB or Polars
Need to integrate with existing SQL workflows? → DuckDB
Powering real-time dashboards? → DuckDB
Operating under memory constraints? → Polars or DuckDB
Preparing data for BI dashboards at scale? → PySpark or DuckDB

By systematically evaluating these factors, users can make more informed decisions about which data processing tool or combination of tools best fits their specific project requirements and team capabilities.

Real-World Examples

Example 1: Log File Analysis (10GB)

Processing server logs to extract error patterns:

Bad choice: Pandas (slow, memory issues).
Good choice: DuckDB (can directly query the log files).

Sample code:

import duckdb
error_counts = duckdb.query("""
 SELECT error_code, COUNT(*) as count 
 FROM 'server_logs.csv' 
 GROUP BY error_code 
 ORDER BY count DESC
""").df()

Example 2: E-commerce Data (30GB)

Analyzing customer purchase patterns:

Bad choice: Pandas (will crash)
Good choice: Polars (for transformations) + DuckDB (for aggregations)

Sample code:

import polars as pl
import duckdb

# Load and transform with Polars
df = pl.scan_csv("transactions.csv")
df = df.filter(pl.col("purchase_date") > "2023-01-01")

# Convert to DuckDB for complex aggregation
duckdb.register("transactions", df.collect())
customer_segments = duckdb.query("""
 SELECT customer_id, 
 SUM(amount) as total_spent,
 COUNT(*) as num_transactions,
 AVG(amount) as avg_transaction
 FROM transactions
 GROUP BY customer_id
 HAVING COUNT(*) > 5
""").df()

Example 3: Sensor Data (100GB+)

Processing IoT sensor data from multiple devices:

Bad choice: Any single-machine solution
Good choice: PySpark

Sample code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import window, avg

spark = SparkSession.builder.appName("SensorAnalysis").getOrCreate()

sensor_data = spark.read.parquet("s3://sensors/data/")

# Calculate rolling averages by sensor
hourly_averages = sensor_data \
 .withWatermark("timestamp", "1 hour") \
 .groupBy(
 window(sensor_data.timestamp, "1 hour"),
 sensor_data.sensor_id
 ) \
 .agg(avg("temperature").alias("avg_temp"))

In Part 2 of this series, we will dive deeper into the medium data range (1–50GB) with a detailed comparison of DuckDB vs Polars, including performance benchmarks across common data operations. We will also explore exactly when SQL outperforms Python (and vice versa) for several data tasks.

Conclusion

As your data scales, so should your tools. While Pandas remains a solid choice for datasets under 1GB, larger volumes call for more specialized solutions. Polars shines for Python users handling mid-sized data, DuckDB is ideal for those who prefer SQL and need fast analytical queries, and PySpark is built for massive datasets that require distributed processing.

The best part? These tools aren’t mutually exclusive, many modern data workflows combine them, using Polars for fast data wrangling, DuckDB for lightweight analytics, and PySpark for heavy-duty tasks. Ultimately, choosing the right tool isn’t just about today’s dataset, it is about ensuring your workflow can grow with your data tomorrow.

Connect with me on LinkedIn

Connect with me on Twitter

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)

Author(s): Gift Ojeabulu

Outline

Introduction

The Data Size Decision Framework & Flowchart

Small Data (< 1GB)

Medium Data (1GB to 50GB)

Big Data (Over 50GB)

Additional factors to consider

Real-World Examples

Example 1: Log File Analysis (10GB)

Example 2: E-commerce Data (30GB)

Example 3: Sensor Data (100GB+)

Conclusion

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Beyond Pandas: The Modern Data Processing Toolkit for Data Engineering (Part 1)

Author(s): Gift Ojeabulu

Outline

Introduction

The Data Size Decision Framework & Flowchart

Small Data (< 1GB)

Medium Data (1GB to 50GB)

Big Data (Over 50GB)

Additional factors to consider

Real-World Examples

Example 1: Log File Analysis (10GB)

Example 2: E-commerce Data (30GB)

Example 3: Sensor Data (100GB+)

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement