From OLTP to Data Lakehouse

Last Updated on June 9, 2022 by Editorial Team

Author(s): Guilherme Banhudo

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

A layman’s overview of the Data ecosystem in a few paragraphs

The Data ecosystem has grown exponentially in the last years, with emerging technologies and increasing vendor-hype clouding the way for newcomers. This post is my brief and simple attempt to explain the origin and path from OLTP to Data Lakehouse in a few short paragraphs.

Bear with me, hopefully, in five minutes, the journey will be clear.

From the top to bottom approach, we can generically think of two broad categories of Databases: SQL and NoSQL. In simple and generic terms, the former is relational databases whereas the latter are non-relational. The former are table-based, whereas the latter can have multiple storage options (key, document, graph, and wide-column). The former hinges heavily on the concept of multi-row Transactions (you can think of Transactions as a group of tasks that either succeed as a whole, or fail as a whole), and the latter is more suitable for unstructured data (although lately, this line has become increasingly blurrier).

Relational Database Management Systems (RDBMS) implement ACID compliance, which, among others, manages Transactions for SQL-based databases. NoSQL databases, in turn, abide by the BASE compliance (I wrote a long article on the subject).

Most backend systems rely on Online Transaction Processing (OLTP), handling multiple Transactions, the record reads, writes, updates, and deletes (CRUD operations), and typically, do not save historical data. However, for data teams, every change is important hence, we require systems that can handle extremely large amounts of records holding every single business-relevant change which occurred in the operational databases! But this large data now needs to be stored and queried efficiently. For this purpose, RDBMS has been expanded to support complex queries to large amounts of historical data, aggregated from OLTP databases and other sources, resulting in Online Analytical Processing (OLAP). Typically, this vast data is stored in the form of a Data Warehouse, data structures optimized for analysis that often reaches Petabytes of scale.

While OLAP works for several situations, like all RDBMS, they are vertically scalable (note: Distributed SQL does tackle this point, however, its popularity is small at the moment). NoSQL in turn is horizontally scalable meaning, lower data storage costs as your data grows, a highly desirable but unattainable characteristic for OLAP for a long time, since the missing ACID compliance in NoSQL, is incredibly useful. What if we could combine RDBMS ACID compliance, OLAP’s massive historical data analysis capabilities, with NoSQLs horizontal scalability? This is where the Data Lakehouse shines.

Data Warehouse, Data Lake, and Data Lakehouses, image by Databricks

Data Lakehouses leverage the concept of Data Lakes, file systems supporting vast amounts of structured and unstructured data at low costs; Data Warehouses; and decoupled processing. In other words, Data Lakehouses do not require you to have a server constantly online as RDBMS do, and use Data Lakes for storage. Why is this important?

If your data does not need to be accessed every single second, then you can save on the cost of having a server constantly online, with the tradeoff of having some latency to retrieve your data
In fact, since processing is independent of storage, you can actually use more powerful clusters to process your data, which only run — and you only pay for — for the duration of the workload. Meaning, you get more juice for a fraction of the cost. In addition, you can even configure per-workload server characteristics, further optimizing your costs
RDBMS are expensive to grow, relying on expensive and coupled hardware. But a Data Lakehouse relies on files that, with some magic, have an ACID compliance layer on top of it, along with some other nifty optimizations and features (see Delta.io). And we all know, file storage is cheap, especially when you can optimize your cost around the file usage frequency (unfrequent file access layers are significantly cheaper)
You can build your Data Warehouse directly in Delta leveraging the aforementioned advantages
Delta contains a set of very powerful features that greatly facilitate the Data Engineer’s job: ACID transactions, scalable metadata, time travel, unified batch/streaming, schema evolution & enforcement, audit operations, DML operations, etc. All of which, in an open-source project so, no vendor lock-in
You can hold data in multiple formats and curation levels, serving multiple use cases: from Data Analysis/BI to Data Science/ML

Are there no drawbacks? Like all technologies, none is perfect and the choice of technology must be made according to the context and requirements. Lakehouses are no different, and they suffer from their own set of drawbacks:

The technology is fairly recent (Delta Lake 1.0 was certified by the community in May 2021) meaning, several features may mutate and some highly desirable features may be missing for the time being
The technology is very different from traditional RDBMS, especially, in terms of optimization thus representing a larger risk and may be harder to find experienced professionals
They are hard to optimize, relying on several different techniques, some of which, must be implemented from the start of the project (Bloom Filters)
Like all new technologies, documentation, examples, and best practices are scarce
Finally, it is important to reflect on the BI/reporting layer since some BI tools (Power BI for instance), will make a query for each visual hence, complex dashboards, especially those users interact a lot with, may drive high costs

Let me know if the article helps shed some light on the ecosystem — although in simple terms — or if you feel something is not quite accurate!

From OLTP to Data Lakehouse was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

From OLTP to Data Lakehouse

Author(s): Guilherme Banhudo

A layman’s overview of the Data ecosystem in a few paragraphs

Towards AI Team

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

From OLTP to Data Lakehouse

Author(s): Guilherme Banhudo

A layman’s overview of the Data ecosystem in a few paragraphs

Towards AI Team

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥