Take a Dive Into Delta Lake

Author(s): Disha Verma

Originally published on Towards AI.

Take a Dive Into Delta Lake — That’s Jerry — the frustrated Data Steward!

Remember the time we spoke about Data Warehouse, Data Lake and Data Lakehouse? Today, we will learn about Delta Lake that belongs to the same data architecture family. A team at Databricks came up with the idea of a fast storage layer built on top of data lakes. Organizations already using data lakes loved the concept of “Delta Lake,” which could handle massive data loads efficiently — often processing them in just a few minutes.

Disha

Hieeee!!!! I am here to learn about you today! Tell me some interesting facts that I can share with my friends here.

Additionally, I am really confused as to why organizations opted for you when they had Data Lake?

Delta Lake

Hey Disha! Happy to meet you and talk about, well, me! Let’s begin with my inception. Please don’t mind Mr. Data Steward, he’s been really overwhelmed with so many data terms and frameworks.

Why did Delta Lake come into being?

Michael Armbrust (Databricks employee) came up with an idea of creating me so that there’s an efficient transaction with large volume of data — too technical? Let me simplify it for you.

Imagine like you imagined during data lake — you have a huge dump of files, CDs, images, documents to store somewhere.

Now, data lake was already handling these records, but, Delta Lake provided ACID compliance (explained below) to your records. Secondly, it made processing of the records 10 times faster than a data lake. And third, companies like Apple were able to process 300 billion records every day.

Isn’t that great?!

Delta Lake — The Definitive Guide by Databricks

Let’s discuss some of the benefits of using me over Data Lake:

Benefit #1: ACID Compliance

Before we dive deeper, let me explain you ACID (Atomicity, Consistency, Isolation, Durability) compliance in layperson language. This term plays a vital role in understanding how I was invented:

A — Atomicity

Let’s say you’re running a 10K marathon. You’re halfway through when you injure your knee and have to stop.

To earn a medal, you must cross the finish line. But in this case — you don’t.

That’s Atomicity. You either finish the race and get the reward, or you don’t.
No partial credit. No “almost there” badge.

In databases, it’s the same idea:
A transaction is either fully complete, or not done at all.
There’s no such thing as half a transaction.

C — Consistency

Let’s consider another example in healthcare domain. A patient has undergone a surgery and as per hospital rules the following checklist should be finished before they can be discharged:

Final test reports
List of prescribed medicines
Billing

However, when the patient is discharged, their billing could not finish due to system glitches — an incomplete state.

In database world, ‘Consistency’ reflects that a transaction must not leave in an incomplete or inconsistent state when leaving the source system and moving to the target system.

I — Isolation

Whenever you visit a grocery store, customers line up one after another for their items to be checked out. Now imagine, while the cashier is checking out your items, they suddenly start scanning items from the next customer’s cart too. This would cause so much chaos!!

Isolation ensures each customer’s items are scanned individually and ONLY one customer is handled at a time. Similarly in database, if multiple transactions occur, each one is processed completely before the next one begins.

D — Durability

It’s late at night and you’re binge-watching a show on Netflix. You’re tired, so you hit pause and head to bed. The next day, you open Netflix again — and the episode resumes exactly where you left off.

Even if you turned off the TV, lost power, or closed the app, your spot was saved.

This is Durability!!

It’s the same principle in the database world— Once a transaction is completed, it must be saved permanently, no matter what — power failure, system crash, or database outage.

ACID compliance guarantees that your data won’t just disappear. If it’s saved, it stays saved — just like your Netflix progress.

Benefit #2: Schema Enforcement

I hope you remember we talked briefly about schema-on-read (SoR) and schema-on-write (SoW) operations in the Data Lake blog. While Data Lake follows SoR, I allow SoW (similar to a Data Warehouse).

For reference, Data Lake explanation about SoR and SoW in Blog #1

That’s why schema enforcement and ACID compliance work so well together. If there’s no check on the structure of incoming data, incorrect or mismatched data can slip in — and that breaks the rules ACID is supposed to protect.

Benefit #3: Time travel

Have you ever visited DMV to get your driver’s license?

You may have encountered situations where your license expires, you visit DMV for a new one and they discard your old license since its not valid anymore. However, do you know that DMV keeps a track of all your licenses even though they’ve discarded it for you?

This is the other benefit I offer — Time Travel. Even if you’re using a table that is updated and even if your older tables have been replaced, I always have a copy of the older versions (historical data).

This way you can travel back in time and look at the previous versions of your table.

To conclude, ACID compliance, schema enforcement and time travel are few important reasons why Delta Lake was created. There are many other benefits, but covering all of them in one blog isn’t possible.

Disha:

So where do these benefits actually live?

Delta Lake:

All the powerful features we just discussed — ACID compliance, schema enforcement, and time travel — are made possible through a Delta Table, the building block of Delta Lake.

Delta Lake Table (or Delta table)

Delta Lake contains a table similar to a database table called Delta Lake Table but is stored in a Delta Lake format.

Delta Lake format => ACID + Schema Enforcement + Time Travel (+ two more benefits that we will explore in an upcoming blog).

Every table in a Databricks platform is a delta table. In addition to the various benefits described above, it utilizes parquet format (most efficient format) to handle large amounts of data.

A parquet format file follows a columnar based approach. To better understand this concept, check out Reference section. Here’s a sample parquet file:

From https://data-mozart.com/parquet-file-format-everything-you-need-to-know/

Let us understand Delta Live Table analogy below and then the concept of Delta Table will make better sense.

Delta Live Table

Disha:

This is all too confusing — Delta Table, Delta Live Table :(…

Delta Lake:

That is understandable! Even Mr. Jerry was agitated, more than he usually is, when he learnt about this term (Delta Live). Let me simplify this one for you!

Imagine you walk into a restaurant as the very first customer. As soon as you enter, the entire staff gets into action — someone brings you a glass of water, another takes your order and chef in the kitchen starts prepping your meal.

In this setup, the restaurant head is the Delta Live Table as she gets everyone to work, makes sure everything runs smoothly and on time. You are the data here that is being handled. Additionally, Delta Table mentioned above is the meal prepared by the chef for you.

Delta Live table is an ETL pipeline that triggers as soon as there’s a file ready for ingestion.

For people who have been working in data field for a while — think of delta live tables as ADF (Azure Data Factory). The only difference — while ADF is Azure-native, Delta Live is Spark-native and works only with Databricks.

Additionally, delta live tables handle only data pipelines on Delta Lake where as ADF can be used for a general purpose pipeline.

There are a few more interesting topics like batch vs. stream processing, auto loader, data lineage, data swamps and more. I’ll be covering those in next blog soon — stay tuned!

References

Interesting Read:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Take a Dive Into Delta Lake

Author(s): Disha Verma

Why did Delta Lake come into being?

Benefit #1: ACID Compliance

Benefit #2: Schema Enforcement

Benefit #3: Time travel

Delta Lake Table (or Delta table)

Delta Live Table

References

Delta Lake vs Data Lake – What's the Difference?

Understand the difference between Delta Lake and a data lake

Parquet file format – everything you need to know! – Data Mozart

New data flavors require new ways for storing it! Learn everything you need to know about the Parquet file format

Delta Live Table 101-Streamline Your Data Pipeline (2025)

Databricks Delta Live Tables simplify data pipeline development through incremental, reliable data processing. Learn…

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Take a Dive Into Delta Lake

Author(s): Disha Verma

Why did Delta Lake come into being?

Benefit #1: ACID Compliance

Benefit #2: Schema Enforcement

Benefit #3: Time travel

Delta Lake Table (or Delta table)

Delta Live Table

References

Delta Lake vs Data Lake – What's the Difference?

Understand the difference between Delta Lake and a data lake

Parquet file format – everything you need to know! – Data Mozart

New data flavors require new ways for storing it! Learn everything you need to know about the Parquet file format

Delta Live Table 101-Streamline Your Data Pipeline (2025)

Databricks Delta Live Tables simplify data pipeline development through incremental, reliable data processing. Learn…

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement