How could data pipelines be represented in the real world? What could be a data pipeline? A highway as shown in the picture, a freeway? — Source: Derivate from original by Eric Weber on Unsplash

Member-only story

Data Engineering, Editorial

Diving Into Data Pipelines — Foundations of Data Engineering

An Introduction to Data Pipelines. How, What, When, and Why?

Towards AI Editorial Team

Published in

Towards AI

11 min readApr 17, 2021

Author(s): Saniya Parveez, Roberto Iriondo

Join us ↓ | Towards AI Members | The Data-driven Community

Join Towards AI, by becoming a member, you will not only be supporting Towards AI, but you will have access to…

members.towardsai.net

This article covers an extensive introduction with step-by-step explanations and code on data pipelines to introduce the foundations of data engineering. Data pipelines are used extensively in data science and machine learning and are crucial on machine learning workflows to integrate data from multiple streams to gain business intelligence for competitive analysis and advantage.

What is a Data Pipeline?

A data pipeline is a set of rules that stimulates and transforms data from multiple sources to a destination where new values can be obtained. In the most simplistic form, pipelines may extract only data from different sources such as a REST API, databases, feeds, live stream, and so on. These are loaded to a destination, such as a SQL table in a data warehouse. Data pipelines are the foundation of analytics, reporting, and machine learning capabilities.

Data pipelines are constructed with multiple steps such as data extraction, data preprocessing, data validation, data storage, and others. They can be developed by using multiple programming languages and tools.

Figure 1: An straightforward data pipeline flows, in this case with Amazon Web Services (AWS) Sagemaker. — Figure 1: A straightforward data pipeline flows, in this case with Amazon Web Services (AWS) Sagemaker.

Well-built data pipelines do more than just extract data from sources and load them into manageable database tables or flat files for analysts to use. They perform several steps with raw data, including cleaning, structure, normalization, combining, aggregation, and so on. A data pipeline also requires other actives such as monitoring, maintenance, enhancement, and support of different infrastructures.