Understanding Data Lineage: From Source to Destination
Last Updated on December 11, 2023 by Editorial Team
Author(s): Muttineni Sai Rohith
Originally published on Towards AI.
I went to a restaurant yesterday, βAnthera.β After eating my fourth or fifth piece of pepper chicken, which, by the way, was delicious, I started to be amazed by our capability to digest and savor it. The way we use our mouth to taste, grind, mince, and swallow it, followed by our body transforming it, helping us in digesting the food, and filtering the wastes using the kidneys β itβs such a properly defined process, with each part having an important function. This makes me think that in the universe, everything is made by a design rather than a choice;
Just as our body has a well-defined process, a data project needs a clear flow β thatβs where Data Lineage comes in. Design and architecture play a big role in data projects. Being worked on a live data streaming project, even a 30-second latency reduction can generate millions of profits for the firm. All this will be possible by having proper Data Lineage (DL) β by understanding and designing the flow from the start to the end, letβs get started, by knowing more about Data Lineage.
What is Data Lineage?
Data Lineage is the process of understanding, recording, and visualizing the data as it flows from start to end. From Originating Data Sources to Consumption, it aims to show the complete Data Flow. This includes all the transformations that data underwent along the way and how it is stored. Data Lineage Aids in maintaining data quality, reliability, and consistency.
- Data Quality: Data Lineage helps identify and rectify any inconsistencies, errors, or inaccuracies that may arise during the data flow, assuring the quality of data.
- Reliability: It minimizes the risks and disruptions from ongoing process changes and helps reduce ripple effects caused by the data transformations.
- Consistency: By knowing about the upstream and downstream movement of data from our tables, Data Lineage provides a clear map of how data moves through a system over time, enhancing consistency.
Further, it helps, in the optimization of the processes by identifying bottlenecks, redundancies, or inefficient paths. Together, Data Lineage helps in proper informed decision-making in the development stages.
Data Lineage Techniques and Examples
While Data Lineage can be done simply by proper documentation or Visual Flow diagrams, there are some advanced techniques that can be implemented in the process β
- Pattern-Based Lineage: Based on the recognizable patterns in the data transformations or input data streams, data lineage can be performed by using the patterns. For example, identifying the data transformations that are involved in demographic-related operations.
- Data Tagging: Applying metadata tags to datasets, columns, or individual data elements to categorize and track them throughout the data lifecycle. For example, Customer data can be tagged as βPIIβ (Personally Identifiable Information) or βSensitiveβ to ensure proper handling and compliance.
- Rule-Based Lineage: Defining rules that explicitly capture how the data can be transformed or Used. For example, a rule to standardize customerβs phone numbers before usage throughout the system.
- Data Dependency Graphs: Representing the data lineage as a graph, with nodes as data entities and edges as transformations or relations between them.
How to get started with Data Lineage?
General Data Protection Regulation (GDPR), which took effect in May of 2018, suggests organizations to focus on data lineage. With loads of advantages and a lot of in-hand techniques to lineage, to start with the data lineage, hereβs the heads-up β
- Identify Data Elements: Contact business users and understand the process clearly end to end to identify critical points for business function.
- Tracking Origin: Track the data origin or a process origin and understand it clearly.
- Note Sources and Links: Create a visual flow to represent data sources and linking elements that can be combined.
- Create a Map: Build maps for each system and a master map of the whole picture.
Today, there are comprehensive data quality solutions in the market that include data lineage. These tools can easily sort and organize your data β saving time and money, and resulting in noticeable margins.
Data lineage vs. Data Provenance vs. Data Governance
While Data Lineage is a record of how data is transformed and migrated from Origin to End, Data Provenance is the documentation of the origin, access, ownership, modification, and history of a data element. This process helps prove data integrity and accuracy. and Data governance covers the full set of policies and processes for managing data quality, privacy, and compliance.
Data lineage and provenance ensure that you always know where data was sourced and how it migrated and transformed to the destination. Meanwhile, governance ensures that data collection and storage is standardized and follows a predefined set of processes and best practices.
Ending Note: With Organizations going through employee shifts, layoffs and reorgs happening within, Data Lineage can act like a supervisor explaining the projectβs story, Moreover, when Data Lineage, provenance and Governance are used together in a correct way, it helps a lot in ensuring the data reliability and consistency…
Make sure to implement this if not implemented yet in your workβ¦
Happy Learningβ¦
Muttineni Sai Rohith Signing off..
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI