Towards AI Can Help your Team Adopt AI: Corporate Training, Consulting, and Talent Solutions.

Publication

Deletion Vectors in Delta Tables: Speeding Up Operations in Databricks
Data Engineering   Latest   Machine Learning

Deletion Vectors in Delta Tables: Speeding Up Operations in Databricks

Last Updated on December 21, 2023 by Editorial Team

Author(s): Muttineni Sai Rohith

Originally published on Towards AI.

Traditionally, Delta Lake supports only the Copy-On-Write paradigm, in which underlying data files are changed anytime a file has been written. Example: When a single row in a file is deleted, the entire parquet file has to be rewritten. Considering your data is scattered among multiple files, and there are frequent updates to the data, this paradigm is not going to be efficient. For this kind of requirement, Databricks recently released a new feature named Deletion Vectors. In this article, we are going to learn more about Deletion Vectors, their usage, and how to enable them.

Deletion vectors are a storage optimization feature that can be enabled on delta lake tables. With deletion vectors enabled for the table, a new paradigm called — ‘Merge on Read’ is introduced in delta lake tables. Delete and update operations use deletion vectors to mark existing rows as removed or changed without rewriting the parquet file. Subsequent reads on the table resolve the current table state by applying the deletions noted by deletion vectors to the most recent table version.

Enabling Deletion Vectors

It is advised to use Databricks runtime 14.1 version or above to write tables with deletion vectors to leverage all the optimizations. Even though the Databricks 12.1 runtime version supports reads from tables with deletion vectors, it is advised to have a 14.1+ version because it supports writes and optimizations. Moreover, to enable row-level concurrency, we can start using the 14.2+ version if we are really making the switch.

Deletion vectors can be enabled by using following —

CREATE TABLE <table-name> [options] TBLPROPERTIES ('delta.enableDeletionVectors' = true);

ALTER TABLE <table-name> SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

How does deletion happen in Delta tables using deletion vectors?

Whenever we delete or update a record in delta tables, the writes marks positions of the changed rows separately from the data files themselves, this operation is called ‘soft deletion’. The position of the deleted rows are encoded in a highly compressed bitmap format, RoaringBitmap, that can be compacted into the data files separately later.

These changes are applied physically when data files are rewritten, as triggered by one of the following events:

  • When we apply OPTIMIZE Command on the table.
  • Auto-compaction triggers a rewrite of a data file with a deletion vector.
  • REORG TABLE ... APPLY (PURGE) is run against the table.

Events related to file compaction do not have strict guarantees for resolving changes recorded in deletion vectors, and some changes recorded in deletion vectors might not be applied if target data files would not otherwise be candidates for file compaction. REORG TABLE ... APPLY (PURGE) rewrites all data files containing records with modifications recorded using deletion vectors.

REORG TABLE —

Reorganize a Delta Lake table by rewriting files to remove soft-deleted data.

REORG TABLE table_name [WHERE predicate] APPLY (PURGE)

Example:

REORG TABLE events APPLY (PURGE);

REORG TABLE events WHERE date >= '2022-01-01' APPLY (PURGE);

Note:

  • REORG TABLE only rewrites files that contain soft-deleted data.
  • REORG TABLE is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.
  • After running REORG TABLE, the soft-deleted data may still exist in the old files. We can run VACUUM to delete the old files physically.

Note: Databricks leverages deletion vectors to power predictive I/O for updates on photon-enabled computing. See official documentation — Use predictive I/O to accelerate updates to learn more about Predictive I/O.

When should we use deletion vectors?

With the optimization deletion vectors bring while performing deletion operations on tables, it might be considered as an obvious choice, but there are some tradeoffs that have to be considered while using deletion vectors —

  1. Write Frequency & Latency SLA: Use Deletion Vectors when the write frequency is high, or when low write latency is required, particularly for small data changes that cause large write amplification in the traditional Copy-on-Write mode. In low write frequency scenarios with flexible latency requirements, Deletion Vectors might not be as advantageous.
  2. Read Frequency & Latency SLA: Exercise caution with Deletion Vectors in high-read scenarios, as the cumulative cost of additional execution time to process the additional Deletion Vector files can add up.
  3. Data Layout & Change Distribution: Deletion Vectors excel when data changes are spread across many files, making the write amplification that comes with the traditional copy-on-write method extraordinarily expensive. Pay particular attention to the matching predicates for data changes. The Levi library provides helper methods for easy access to the underlying file statistics.

Limitations:

Databricks suggests not enabling deletion vectors for streaming tables when using either Databricks SQL or Delta Live Tables.

In Databricks Runtime 12.1 and greater, the following limitations exist:

  • Delta Sharing is not supported on tables with deletion vectors enabled.
  • You cannot generate a manifest file for a table with deletion vectors present. Run REORG TABLE ... APPLY (PURGE) and ensure no concurrent write operations are running in order to generate a manifest.
  • You cannot incrementally generate manifest files for a table with deletion vectors enabled.

Conclusion:

It is more of a learning for me when I go through all these, and what I am doing here is not writing anything but just referring to Databricks and Delta pages. They are awesome. Sharing them here for reference —

Happy Learning…

Muttineni Sai Rohith Signing off…

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓