Our terms of service are changing. Learn more.

Publication

Latest

What is Data Engineering, who are Data Engineers, what role do they play in Data science, and how…

Author(s): Kunal Ajay Kulkarni

Data Engineering

What is Data Engineering, who are Data Engineers, what role do they play in Data science, and how to become one?

Data Engineering: A brief introduction!

Photo by CHUTTERSNAP on Unsplash

From helping Facebook tag you in photos to helping Netflix and Spotify recommend you your favorite movies and songs, the field of Data Science has evolved rapidly and created a lot of hype. Data Science and the profession of Data scientist have become incredibly sought after and the most demanding job. According to Harvard Business Review, it has been named the hottest job of the 21st century. A skilled data scientist can add immense value to businesses by harnessing the extreme power of data. But what exactly is data engineering? Who is a data engineer and what role he or she plays in data science? In this article, we will learn about bits and bytes of data engineering.

What is Data Engineering?

Data is all around us and is growing exponentially day by day. This has given rise to a new(although not so new) field of data engineering, a sub-discipline of data science, that focuses entirely on the collection, transportation, transformation, and storage of a vast amount of data. Perhaps you have seen some big data job postings online and are curious by the prospect of handling petabyte-scale data. Maybe you’ve never even heard of data engineering but are interested in knowing how application developers handle the vast amounts of data necessary for most applications today. No matter which category you fall into, this introductory article is for you. You’ll get an overview of this field, including what data engineering is and what kind of work it does.

We know that most companies store their valuable data in various formats across their databases. To understand what data engineering is, first we need to focus on the “engineering” part. Engineering and engineers are used to design, build, and implement various complex systems to make our life easier. Therefore, Data Engineers design, build and implement systems and tools that transform the raw data into a more sophisticated and usable format that data scientists or other users of the organization can use for different purposes. These systems, commonly known as data pipelines, collect, saves, validates, and transforms the data from various sources and store them in a single database, typically known as data warehouses.

Data engineering is a part of data science that focuses on practical applications and harvesting of data. Data engineering is just as important as data science. Regardless of your interest level in learning data engineering, it is important to know exactly what data engineering is all about. Maxime Beauchemin, the original author of Airflow, characterized data engineering in his blog post The Rise of Data Engineer:

The data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so-called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and computation at scale.

Photo by Boitumelo Phetla on Unsplash

Who is Data Engineer and what role he/she plays in Data Science?

In general, data science is a very broad field that offers multiple roles, including everything from collecting, cleaning, processing, analyzing, and deploying predictive models or machine learning algorithms. In many companies, it may not have a specific title to the role he or she plays. A data engineer transforms raw data into useful formats for analysis.

Like data scientists, a data engineer writes code too. But unlike data scientists, data engineers build tools, infrastructure, frameworks, and services. We can say that data engineering is much closer to software engineering than it is to data science.

The data collected by data engineers can be used for various data-driven work such as, testing, training, and developing machine learning models, performing EDA, designing system architecture, and database design. This data can be obtained in several ways, and using the specific tools, techniques, and skills required to obtain the data will vary widely across organizations, and desired outcomes. However, a common pattern is the data pipeline. A data pipeline is a system that consists of many independent programs that do various operations on the collected data. Data pipelines are often distributed across multiple servers:

Source

Depending on where the data comes from, the collected data is processed in batches using a data pipeline. The data engineer is responsible for these pipelines. Data engineering teams are responsible for the collection, design, construction, implementation, maintenance, extension, and often, the infrastructure that supports these data pipelines. They may also be responsible for collecting incoming data through various sources and how that data is stored more often.

Many data engineering teams are also responsible for building efficient data platforms. It is just not enough in many companies to have just a single data pipeline to collect the incoming data to a SQL database. Many large companies collect a vast amount of data daily and they have multiple teams that need different kinds of data for different purposes.

Photo by Martin Shreder on Unsplash

Responsibilities of a Data Engineer:-

The data used by data scientists or other teams for analysis must be cleaned and made accessible to all the concerned users of the organization. These requirements are fully explained in the excellent article The AI Hierarchy of Needs by Monica Rogarty. As a data engineer, you are responsible for addressing your customers’ data needs. However, you’ll use a variety of methods to fulfill their individual needs.

Data Engineering mostly belongs to the 2nd and 3rd levels of the hierarchy. Source

To perform various operations with data, you must first ensure that the system has a continuous flow of data. This data can come from a variety of sources —

  1. Tweets, likes, comments, videos, and images, etc.
  2. Sensors, industrial equipment, medical devices, games, satellites, CCTVs, etc.
  3. Invoices, payment orders, receipts, live streams, etc.

Data engineers are often responsible for collecting, and storing this data, designing a system that can collect this data as input from one or many sources, transform it, and then store it for their users. These systems are called ETL pipelines, which stand for Extract, Transform, and Load. Remember that ETL is a very broad concept. It is not just about these 3 steps. The ETL process is technically very challenging and requires active participation from all data engineers, data scientists, developers, analysts, SWEs, and others.

Source

The first step in the ETL is Extraction. In this step, data from various sources is extracted in a variety of formats. This step is often time-consuming. The data engineer is responsible for pulling the data into the data pipeline. But this just doesn’t stop here. They have to ensure that the pipeline is robust enough to stay up in unexpected events such as corrupted data, servers going offline, and bugs and viruses. Keeping the system running 24/7 is very important, especially while collecting live data or time-sensitive data.

The second step of the ETL process is Transformation. After the data is extracted and stored, it needs to be transported to the physical systems throughout the organization for further analysis. Hence, the data is cleaned and processed by using different tools and technologies and is converted to a single standard usable format. This includes tasks such as, filtering the data, cleaning, joining, creating, splitting, and deleting the data.

The third and final step of the ETL process is Loading. In this last step, data is loaded into the targeted data warehouse. Sometimes, this data is updated and changed frequently by the data engineers.

Photo by Taylor Vick on Unsplash

How Much Do Data Engineers Make?

According to Payscale, the average salary for a data engineer in India is ₹835, 135 per year, with a reported salary range of ₹ 470,000 to ₹ 1,847, 931 depending on skills, experience, and location.

Source

Skills required for Data Engineering –

A data engineer needs to have the following skills –

  1. Programming languages — SQL, Python, R, Java, Julia, MATLAB, etc.
  2. RDBMS and Non RDBMS — MySQL, PostgreSQL, MS SQL Server, MongoDB, DynamoDB, etc.
  3. Cloud skills — AWS, GCS, Azure, etc.
  4. Distributed Systems — Apache Kafka, Hadoop, Spark, etc.
  5. Machine learning algorithms
  6. ETL Tools

Soft skills –

  1. Presentation Skills
  2. Business Acumen
  3. Communication
  4. Collaboration
Photo by Daniel Schludi on Unsplash

How to become a data engineer?

There is no well-defined path to become a data engineer. Here are some courses you can take if you want to become a data engineer –

  1. Become a data engineer — Udacity Nanodegree
  2. Data Engineering with Google Cloud — Professional certificate by Coursera
  3. Data Engineering Foundations Specialization — IBM by Coursera
  4. Big Data Specialization — Coursera
  5. Microsoft Certified: Azure Data Engineer Associate
  6. AWS Certified Big Data — Specialty
  7. Introduction to Data Engineering by Datacamp
  8. Data Engineer track by Dataquest

Conclusion —

With this, we conclude our introduction to Data Engineering. Now you can decide whether you want to dive deeper into this really exciting and highly rewarding field. Does this excite you? Are you interested in exploring it more deeply? Let me know in the comments below!

Thanks for reading!!


What is Data Engineering, who are Data Engineers, what role do they play in Data science, and how… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓