What I Learned Today About Apache Spark Architecture

Last Updated on December 9, 2025 by Editorial Team

Author(s): Abinaya Subramaniam

Originally published on Towards AI.

Apache Spark often feels magical when we first start using it. We write a few lines of PySpark code, hit run, and suddenly terabytes of data are being processed in seconds. But behind this simplicity lies a powerful and beautifully engineered distributed system. Understanding Spark’s architecture is the key to writing efficient code, optimizing queries, and making the most of Databricks.

What I Learned Today About Apache Spark Architecture — Image by Author

We will explore what Spark is actually doing behind the scenes, how it runs our code, how clusters are organized, why lazy evaluation matters, and what makes Spark so fast. By the end, Spark will feel less like a black box and more like a system we fully understand.

If this topic interests you, you will probably love my PySpark and Databricks introduction!

Understanding the Spark Execution Architecture

A Spark application does not run on a single machine. Instead, it runs on a cluster which is a group of machines working together in parallel. To make this distributed execution possible, Spark follows a well structured architecture centered around three key elements the driver, the executors, and the cluster manager. Each plays a distinct role and together they form the backbone of Spark’s distributed computation model.

The Spark Driver

Every Spark program begins with the driver. We can think of the driver as the brain of our entire application. It runs our main program, creates the SparkSession, analyzes the tasks your code needs to perform, and constructs a plan for how those tasks should be executed on the cluster. The driver keeps track of metadata, manages the overall workflow, and collects results once the work is complete.

When writing PySpark code inside Databricks, our notebook is continuously communicating with the driver. The notebook sends instructions, the driver interprets them, converts them into execution plans, and decides how to distribute the work across the cluster.

Executors

While the driver is the brain, the executors are the workforce. Each executor runs on a separate machine in the cluster. Executors are responsible for executing the tasks assigned by the driver, storing intermediate data in memory for caching, and returning results back to the driver.

The more executors your cluster has, the more tasks can be executed in parallel leading to faster performance and the ability to handle larger datasets.

Each executor has its own slice of memory and CPU cores. This isolation is important because it means that one executor crashing does not necessarily bring down the entire application. Spark automatically handles such failures.

Cluster Managers

SSpark does not manage machines by itself. It relies on an external system called a cluster manager to allocate resources such as CPUs, RAM, and machines. In traditional Spark environments, this cluster manager might be Standalone, YARN, or Mesos.

Standalone mode is Spark’s built-in option, often used for testing or small deployments. YARN, commonly found in Hadoop based enterprises, supports massive clusters and is used widely in production. Mesos is more flexible but is less common today.

Databricks simplifies this entire layer. You do not choose YARN, Standalone, or anything else. Databricks automatically provisions, tunes, scales, and manages the clusters behind the scenes letting you focus solely on the code.

RDD vs DataFrame vs Dataset. What’s the Difference?

To truly understand Spark, it helps to understand how it represents data internally and how that representation has evolved. Spark began with RDDs (Resilient Distributed Dataset), low level distributed collections of objects. RDDs offered immense control but required manual optimization. they were powerful but verbose and inefficient for complex analytical tasks.

Spark then introduced DataFrames, which provide a higher level, table like interface with columns and data types. DataFrames enable Spark to analyze our transformations and automatically optimize them through its Catalyst optimizer. This makes DataFrames far faster and significantly easier to work with than raw RDDs.

A third abstraction, Datasets, combines the type safety of RDDs with the optimizations of DataFrames, but only exists in Scala and Java. In Python, the modern Spark workflow is almost entirely centered around DataFrames and SQL.

Lazy Evaluation: Spark’s Secret Weapon

One of the reasons Apache Spark feels so fast and efficient is because it doesn’t rush to execute every line of code. Instead, Spark waits intentionally. This idea is called lazy evaluation, and it completely changes how our code runs.

When you write transformations like

df = df.filter(df.age > 18).groupBy("city").count()

Spark does not immediately filter, group, or count anything. Instead, Spark quietly builds a logical plan, a blueprint of what needs to be done.

Let’s say we run this,

df = spark.read.csv("sales.csv").select("city", "age", "amount")
df2 = df.filter(df.age > 18)
df3 = df2.groupBy("city").sum("amount")
df3.show()

We just wrote four lines, but Spark actually executes only when .show() is called.

1. Spark Builds a Logical Plan (Unoptimized)

Think of this as Spark’s first draft.

Read CSV
 → Select columns city, age, amount
 → Filter rows where age > 18
 → Group by city
 → Sum(amount)
 → Show

This is a raw plan. No optimization yet.

2. Spark Uses Catalyst to Create an Optimized Logical Plan

Catalyst looks for improvements

push filters down closer to the data source
remove unnecessary steps
prune unused columns
rearrange operations for efficiency

Optimized plan,

Read CSV (only city, age, amount — column pruning)
 → Filter age > 18 (pushed down)
 → Group by city
 → Sum(amount)
 → Show

3. Spark Builds a Physical Plan

Now Spark decides how to execute this across the cluster.

Stage 1:
 - Read CSV in parallel
 - Apply filter on executors
 - Map city/amount pairs

Stage 2 (Shuffle):
 - Move data so all rows of the same city end up together
 - Perform aggregation

Stage 3:
 - Display final result to driver (show)

This is the real plan that runs on executors.

Spark only performs computation when you call an action such as,

show()
collect()
count()
write()

This delayed execution allows Spark to,

Optimize the entire workflow
Remove unnecessary steps
Combine operations
Reduce shuffles and I/O

Lazy evaluation allows Spark to act like a smart planner,

“Before I do any work, let me figure out the quickest, cheapest, and most efficient way to get this done.”

This planning befor executing approach is a major reason Spark can process terabytes of data so quickly.

DAG: How Spark Organizes Your Computation

When an action is triggered, Spark creates a DAG, a Directed Acyclic Graph. The DAG represents the flow of your transformations, step by step.

Here’s what happens:

Our DataFrame code is parsed into a logical plan
Spark optimizes it using Catalyst
The optimized plan becomes a DAG of stages
Each stage contains tasks
Each task runs on an executor

If you think of your job as a recipe, The DAG is the list of step by step instructions, with ingredients, flow, and dependencies.

Understanding DAGs helps you debug, optimize performance, and understand why Spark behaves a certain way.

Transformations vs Actions

Every Spark operation falls into one of two categories. Transformations such as filter, select, join, or groupBy do not execute immediately. They simply add steps to the execution plan. Actions, on the other hand, trigger actual execution. When an action is called Spark constructs the DAG, schedules the tasks, sends work to executors, and performs the computation.

This separation between planning and execution is what makes Spark both flexible and fast allowing it to optimize work right before execution.

Transformations

These return a new DataFrame and are lazy. Examples include,

filter
select
withColumn
join
groupBy

They simply build the plan.

Actions

These trigger execution. Examples include

show
write
count
collect

Understanding Shuffle. The Costliest Spark Operation

One of the another most important concepts in Spark performance tuning is the shuffle. A shuffle happens when Spark needs to redistribute data across machines. For example during a join, a groupBy, a distinct, an orderBy, or a repartition. Imagine each executor is holding different chunks of data. A shuffle forces data to be moved around the cluster so that all the data belonging to the same key lands on the same executor.

Shuffles are expensive because they involve network transfer, disk I/O, data sorting, and high memory usage. Inefficient code often triggers multiple unnecessary shuffles, slowing everything down.

This usually occurs during

groupBy
join
distinct
orderBy
repartition

Wrapping Up

Apache Spark’s architecture is a blend of thoughtful design and powerful engineering. Once we understand how drivers, executors, cluster managers, DAGs, and shuffles work together, we can write PySpark code that is not only correct but optimized and production ready.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

What I Learned Today About Apache Spark Architecture

Author(s): Abinaya Subramaniam

Understanding the Spark Execution Architecture

The Spark Driver

Executors

Cluster Managers

RDD vs DataFrame vs Dataset. What’s the Difference?

Lazy Evaluation: Spark’s Secret Weapon

1. Spark Builds a Logical Plan (Unoptimized)

2. Spark Uses Catalyst to Create an Optimized Logical Plan

3. Spark Builds a Physical Plan

DAG: How Spark Organizes Your Computation

Transformations vs Actions

Actions

Understanding Shuffle. The Costliest Spark Operation

Wrapping Up

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

What I Learned Today About Apache Spark Architecture

Author(s): Abinaya Subramaniam

Understanding the Spark Execution Architecture

The Spark Driver

Executors

Cluster Managers

RDD vs DataFrame vs Dataset. What’s the Difference?

Lazy Evaluation: Spark’s Secret Weapon

1. Spark Builds a Logical Plan (Unoptimized)

2. Spark Uses Catalyst to Create an Optimized Logical Plan

3. Spark Builds a Physical Plan

DAG: How Spark Organizes Your Computation

Transformations vs Actions

Actions

Understanding Shuffle. The Costliest Spark Operation

Wrapping Up

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement