Azure Data Engineer Portfolio Project Series For Beginners (Part-I)
Author(s): Kamireddy Mahendra
Originally published on Towards AI.
Hello Welcome to the Azure Data Engineer Project Series,
Before building the Data Architecture or any data pipelines in any cloud platform, we need to know the basic terms each platform uses and how the platform will work.
The Azure cloud platform provides us with the privilege to create any resources we want in any particular resource group. Here are step-by-step processes we need to follow to build any data architecture.
1. Creating Resource Group
First, We need to create a resource group and then we use it to create any number of resources in it. This way we can ensure that particular resource groups can be accessed only by a particular group of people. Even if we donβt want to use resources in a resource group, we can easily delete them by deleting the resource group in just a single click rather than deleting them one by one.
In real-world projects, there could be multiple resource groups like Dev, UAT, and Prods will be created to build the architecture and to test and deploy it to ensure the readiness for consumption-level use. This can be done by multiple groups of people in real-world projects by collaborating.
Here, Iβm not creating any other resource groups, just building data architecture or pipelines as a single person without any collaboration. That we will see in the next projects. So Here Iβll use only a single resource group to save all resources.
2. Resource Creation As Per the Requirements or Project
After creating resource groups, we need to create resources that we are going to use to build our data pipelines.
Here is the data pipeline building from ADLS to Azure SQL DB.
So, We need to create a Storage Account Resource as ADLS, ADF, and then an SQL DB.
Letβs get started!
To start creating resources, click on the Plus symbol showing on the main page as I have shown in the above image. After clicking on the Plus Symbol, It will navigate to the Create a Resource page as shown in the below image.
If you observe in the above image, there are numerous services that Azure Cloud provides us to perform different types of tasks. As you can see on the left side of the above image, there are many services like AI + Machine Learning, Analytics, Compute, Containers, Databases, DevOps, Integration, Networking, Security, Storage, and many more categories of resources.
You can select any resource category we want to create a resource. Here we need to create ADLS. Itβs a storage service. So we need to select the storage category.
After, clicking on the storage category, you can able to see the resources that come under this category. There are a lot of storage types of services are there, and we will create here a storage account. So Click on the create button on the storage account from the list of services. After clicking on this button, It will open a create a storage account page as shown in the below image.
Now, if you observe different options showing from Basics to Review + Create, under which there are few details that we need to fill in.
If you have multiple, subscriptions, you can select any one among them, If you have only one, by default it considers that one. Here Iβm using a free subscription so it is showing by default as Free Trail.
After that, If we already have the resource group, we can select them by using the drop-down option, If not we have to create a resource group by clicking on create new. If we click on it, then it will return a pop-up asking us to create a resource group name. We have to enter a name, and then click on the Ok Button, that name will create a resource group name. Here Iβm using Dev.
After that, we need to enter a storage name, that should be unique globally. It should contain only small letters and numbers with length from 3 to 24 characters. After that, we need to select region, performance, and redundancy. I have chosen as per my region, standard, and LRS for noncritical scenarios and to reduce the latency.
After entering all the details, click on the Next button. It will move to the next options that you can see at the top after the basics option.
In the advanced option, we need to tick the check box for Enable hierarchical namespace to work this storage account as ADLS, and If not that will work as Blob Storage. You can change any settings you want, Iβm choosing the access tier as cool for rarely accessed data.
Then click on Next, and you need not change any options after that as you keep them as it is. Iβm not changing anything, you can also change as per your requirements. After completing all, we can review the all details and then click on the create button to create a storage account. Then it is going to deploy in a minute. Then, click on the go-to resource button.
It will open a page that navigates us into the storage location that we created. Now you can see the Data storage option. In that, you choose any as per your requirement. Iβm using Containers to store data as this supports large amounts of data and can be used for Data Lakes and big data analytics.
After clicking on the containers, It will open a page to create containers. Click on the Plus container to create a container. I will open a pop on the right side as shown below. Enter the name you want to keep as the container name then click on Create button.
It will create a container. If you want any directories in it, just click on the container that we created that comes under the logs shown below. If you want to upload a file directly into a container, you just click on the upload button and upload your file or if want to create a directory, click on add a directory, which will ask us to enter a name, then click on save will save as a directory in the container.
If you click on the home button and resource group, you can see the resource that we created. You can upload a file that you want to transfer from ADLS.
After that, we need to create an Azure SQL DB.
Again, go back to the home button and start creating resources. It will open a window I have shown in the beginning. Now we need to select a database category and in which we need to select Azure SQL DB and start creating it.
Then click on the create button, and it opens a window of three instances, where you can select any one as per your requirement. Here Iβm using the SQL Databases.
You can start filling in all details, If already resource group is already there we can use it or you can create any other. Then start entering the Database Name, and if already is there any server was exists, we can use it, if not we have to create a server in which our database will be located.
You can give any name that has to be unique globally, select the location, and authentications. Here Iβm using SQL Authentication. If you select this, then you have to enter admin login and passwords. Then you can select any other features as per your requirements, then review and create.
It will start deploying and will be deployed in a couple of minutes, after that go to the set firewall option and give permission to the network by allowing your IP address, as you can enter your IP address in the box, or you can allow the current IP address when it shows while entering into the database by clicking on Query Editor option in the left side.
Now, you can see, that we entered into the database we created, while creating the database I skipped a few options, there it will be asking us to allow sample data, you can also click that check box will help us to have those data in this database.
Next, we need to create a schema of data that we want to get from the source. Here Iβve used sales data. So I need to create a schema as per the source data table or you can also use the auto creation option while building a connection.
You can see after executing the sales retrieval query, it does not return any data. Since we just created a table schema.
Now, Itβs time to create a very important resource, which is an Integration service, that works as Cloud ETL or orchestration tool in Azure cloud. i.e. Azure Data Factory.
In a similar process, as I explained to you before, you can start creating the Data Factory resource in the resource group that we created before.
After entering basic details, you can leave all others as default or you can click on the check box as per your requirements. After that click on the go-to resource button. It will navigate to an ADF launch Studio page as shown below.
Just click on the Launch Studio button to start working. When you click on the launch studio button, it will open another window as the home page of Azure data factory.
Under the Home symbol, you can see four symbols. Those are
i. Author
This option provides us with different main features to create data pipelines, data sets, data flows, power queries and change data capture.
ii. Monitor
This option helps us to monitor all data pipeline runs, trigger runs, integration run times, data flow debugs, and change data captures.
iii. Manage
This option allows us to manage different features like connections, factory, author, security, and source controls.
iv. Learning Center
With the help of this learning Center, we can learn how the data factory will work within a detailed explanation of each category that I mentioned above.
Now, we have created sources as ADLS, Azure SQL DB, and ETL tool as ADF as services we created.
3. Letβs get started building a data pipeline!
I will break down this data pipeline building in a few steps.
Step 1.
- Click on the Author Button, go to pipelines, press on the three dots side to the pipeline, and click on the new pipeline. Then it will open a pipeline workspace that you can see as shown in the below image.
If you look at this workspace, first we can change a data pipeline name in the properties tab on the right side. That will be reflected in the left and above activities options.
Now Itβs time to know what is data pipeline. A Data Pipeline is a logical group of activities that perform any task.
If you observe in the above image, there are a lot of activity groups available. We can select any activity as per our requirements. In this data pipeline, Iβll just copy the data from ADLS to Azure SQL DB. Later we will do even more complex data pipelines.
So, Here Iβll use only the copy activity that I will select from the transform activity and drag it to the workspace in the middle. You can change the activity name as your wish.
After that, We have to choose the source and sink details.
Now, Itβs time to learn about data sets and Linked Services. If you click on source or sink, It will ask you to select the source, or sink dataset, If not available we have to create a dataset that points to our source and sink data tables separately.
Linked Service
This is just like a connection string, It defines the connection information that is needed for the Data Factory to connect with any external resources.
Data Sets
It represents the data structures within the data stored locations, to point to or to reference the data we want to use.
In the process of creating datasets, we need to select the source data location. There could be many source systems will be available. Here we are using ADLS. So we have to select ADLS Gen2 and then the file format that we stored our data in ADLS. Then itβs going to ask for Linked Service.
If a linked service is not created before, we have to create it as new which has to point out our stored data in ADLS.
Here Iβm showing only for sink details in the image. A source connection is a bit easier.
After selecting all necessary connection details, you check the connection by clicking on Test connection. If that returns a green mark, then you established the right connection between your source and sink.
Then Press on the Create button to create a Linked Service. Then Itβs going ask data table that is in the Azure SQL DB. Then check whether the pipeline is working or not by clicking on the debug button.
Then, Itβs going to check whether our pipeline working or not. It turns out to be queued after immediate press on debug. Then in a minute, it will return the activity status as success or failure as in error.
You can see in the below image, that in the next few seconds, it returned a success message. We should understand the pipeline working well without any errors. Now Click on Publish all to save your pipeline. Then only you can trigger it. Now go to Add trigger and select Trigger Now. Iβll use other trigger options in our next projects.
Now, after clicking on the trigger now, the pipeline gets started running. You can see that status in the monitor option, that image shown below. Itβs showing as pipeline status in progress. In the next few seconds, itβs going to return a success message.
You can see in the below image, that our pipeline runs successfully and data has been transferred from our ADLS to Azure SQL DB.
Now, we will check whether or not our data from ADLS is transferred to Azure SQL by writing SQL queries. You can in the below image, previously it did not return anything. Now itβs returning data that you can see in the below image.
Finally, we have built a simple data pipeline from ADLS Gen2 to Azure SQL DB. For the next projects, I wonβt be in this much detail as this will take too much time to write. This is for those who just started learning to build pipelines in the Azure cloud.
From the next projects, I will add many transformations, we can also build data flows, and then we can also automate the pipeline triggers. Iβll share all of them as part of this Azure data engineer portfolio project series.
Would you like to add any points? Feel free to comment down below.
If you have enjoyed this article, Press on Follow button for more instant updates.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI