Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

What I Learned Today About Apache Spark Architecture
Data Engineering   Latest   Machine Learning

What I Learned Today About Apache Spark Architecture

Last Updated on December 9, 2025 by Editorial Team

Author(s): Abinaya Subramaniam

Originally published on Towards AI.

Apache Spark often feels magical when we first start using it. We write a few lines of PySpark code, hit run, and suddenly terabytes of data are being processed in seconds. But behind this simplicity lies a powerful and beautifully engineered distributed system. Understanding Spark’s architecture is the key to writing efficient code, optimizing queries, and making the most of Databricks.

What I Learned Today About Apache Spark Architecture
Image by Author

We will explore what Spark is actually doing behind the scenes, how it runs our code, how clusters are organized, why lazy evaluation matters, and what makes Spark so fast. By the end, Spark will feel less like a black box and more like a system we fully understand.

If this topic interests you, you will probably love my PySpark and Databricks introduction!

Understanding the Spark Execution Architecture

A Spark application does not run on a single machine. Instead, it runs on a cluster which is a group of machines working together in parallel. To make this distributed execution possible, Spark follows a well structured architecture centered around three key elements the driver, the executors, and the cluster manager. Each plays a distinct role and together they form the backbone of Spark’s distributed computation model.

The Spark Driver

Every Spark program begins with the driver. We can think of the driver as the brain of our entire application. It runs our main program, creates the SparkSession, analyzes the tasks your code needs to perform, and constructs a plan for how those tasks should be executed on the cluster. The driver keeps track of metadata, manages the overall workflow, and collects results once the work is complete.

When writing PySpark code inside Databricks, our notebook is continuously communicating with the driver. The notebook sends instructions, the driver interprets them, converts them into execution plans, and decides how to distribute the work across the cluster.

Executors

While the driver is the brain, the executors are the workforce. Each executor runs on a separate machine in the cluster. Executors are responsible for executing the tasks assigned by the driver, storing intermediate data in memory for caching, and returning results back to the driver.

The more executors your cluster has, the more tasks can be executed in parallel leading to faster performance and the ability to handle larger datasets.

Each executor has its own slice of memory and CPU cores. This isolation is important because it means that one executor crashing does not necessarily bring down the entire application. Spark automatically handles such failures.

Cluster Managers

SSpark does not manage machines by itself. It relies on an external system called a cluster manager to allocate resources such as CPUs, RAM, and machines. In traditional Spark environments, this cluster manager might be Standalone, YARN, or Mesos.

Standalone mode is Spark’s built-in option, often used for testing or small deployments. YARN, commonly found in Hadoop based enterprises, supports massive clusters and is used widely in production. Mesos is more flexible but is less common today.

Databricks simplifies this entire layer. You do not choose YARN, Standalone, or anything else. Databricks automatically provisions, tunes, scales, and manages the clusters behind the scenes letting you focus solely on the code.

Image by Author

RDD vs DataFrame vs Dataset. What’s the Difference?

To truly understand Spark, it helps to understand how it represents data internally and how that representation has evolved. Spark began with RDDs (Resilient Distributed Dataset), low level distributed collections of objects. RDDs offered immense control but required manual optimization. they were powerful but verbose and inefficient for complex analytical tasks.

Spark then introduced DataFrames, which provide a higher level, table like interface with columns and data types. DataFrames enable Spark to analyze our transformations and automatically optimize them through its Catalyst optimizer. This makes DataFrames far faster and significantly easier to work with than raw RDDs.

A third abstraction, Datasets, combines the type safety of RDDs with the optimizations of DataFrames, but only exists in Scala and Java. In Python, the modern Spark workflow is almost entirely centered around DataFrames and SQL.

Lazy Evaluation: Spark’s Secret Weapon

One of the reasons Apache Spark feels so fast and efficient is because it doesn’t rush to execute every line of code. Instead, Spark waits intentionally. This idea is called lazy evaluation, and it completely changes how our code runs.

When you write transformations like

df = df.filter(df.age > 18).groupBy("city").count()

Spark does not immediately filter, group, or count anything. Instead, Spark quietly builds a logical plan, a blueprint of what needs to be done.

Let’s say we run this,

df = spark.read.csv("sales.csv").select("city", "age", "amount")
df2 = df.filter(df.age > 18)
df3 = df2.groupBy("city").sum("amount")
df3.show()

We just wrote four lines, but Spark actually executes only when .show() is called.

1. Spark Builds a Logical Plan (Unoptimized)

Think of this as Spark’s first draft.

Read CSV
→ Select columns city, age, amount
→ Filter rows where age > 18
→ Group by city
→ Sum(amount)
→ Show

This is a raw plan. No optimization yet.

2. Spark Uses Catalyst to Create an Optimized Logical Plan

Catalyst looks for improvements

  • push filters down closer to the data source
  • remove unnecessary steps
  • prune unused columns
  • rearrange operations for efficiency

Optimized plan,

Read CSV (only city, age, amount — column pruning)
→ Filter age > 18 (pushed down)
→ Group by city
→ Sum(amount)
→ Show

3. Spark Builds a Physical Plan

Now Spark decides how to execute this across the cluster.

Stage 1:
- Read CSV in parallel
- Apply filter on executors
- Map city/amount pairs

Stage 2 (Shuffle):
- Move data so all rows of the same city end up together
- Perform aggregation

Stage 3:
- Display final result to driver (show)

This is the real plan that runs on executors.

Spark only performs computation when you call an action such as,

  • show()
  • collect()
  • count()
  • write()

This delayed execution allows Spark to,

  • Optimize the entire workflow
  • Remove unnecessary steps
  • Combine operations
  • Reduce shuffles and I/O

Lazy evaluation allows Spark to act like a smart planner,

“Before I do any work, let me figure out the quickest, cheapest, and most efficient way to get this done.”

This planning befor executing approach is a major reason Spark can process terabytes of data so quickly.

DAG: How Spark Organizes Your Computation

When an action is triggered, Spark creates a DAG, a Directed Acyclic Graph. The DAG represents the flow of your transformations, step by step.

Here’s what happens:

  1. Our DataFrame code is parsed into a logical plan
  2. Spark optimizes it using Catalyst
  3. The optimized plan becomes a DAG of stages
  4. Each stage contains tasks
  5. Each task runs on an executor

If you think of your job as a recipe, The DAG is the list of step by step instructions, with ingredients, flow, and dependencies.

Understanding DAGs helps you debug, optimize performance, and understand why Spark behaves a certain way.

Transformations vs Actions

Every Spark operation falls into one of two categories. Transformations such as filter, select, join, or groupBy do not execute immediately. They simply add steps to the execution plan. Actions, on the other hand, trigger actual execution. When an action is called Spark constructs the DAG, schedules the tasks, sends work to executors, and performs the computation.

This separation between planning and execution is what makes Spark both flexible and fast allowing it to optimize work right before execution.

Transformations

These return a new DataFrame and are lazy. Examples include,

  • filter
  • select
  • withColumn
  • join
  • groupBy

They simply build the plan.

Actions

These trigger execution. Examples include

  • show
  • write
  • count
  • collect

Understanding Shuffle. The Costliest Spark Operation

One of the another most important concepts in Spark performance tuning is the shuffle. A shuffle happens when Spark needs to redistribute data across machines. For example during a join, a groupBy, a distinct, an orderBy, or a repartition. Imagine each executor is holding different chunks of data. A shuffle forces data to be moved around the cluster so that all the data belonging to the same key lands on the same executor.

Shuffles are expensive because they involve network transfer, disk I/O, data sorting, and high memory usage. Inefficient code often triggers multiple unnecessary shuffles, slowing everything down.

This usually occurs during

  • groupBy
  • join
  • distinct
  • orderBy
  • repartition

Wrapping Up

Apache Spark’s architecture is a blend of thoughtful design and powerful engineering. Once we understand how drivers, executors, cluster managers, DAGs, and shuffles work together, we can write PySpark code that is not only correct but optimized and production ready.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.