AWS Redshift ETL using Pandas API
Last Updated on January 28, 2021 by Editorial Team
Author(s): Vivek Chaudhary
Cloud Computing
The Objective of this blog is to perform a simple ETL exercise with AWS Redshift Database. Oracle Database tables are used as the source dataset, perform simple transformations using Pandas methods on the dataset and write the dataset into AWS RedshiftΒ table.
- Import prerequisites and connection with sourceΒ Oracle:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(βoracle://scott:scott@oracleβ, echo=False)
2. Extract Datasets from Oracle Database:
#Employee Dataset
emp_df=pd.read_sql_query(βselect * from empβ,engine)
emp_df.head(10)
#Department Dataset
dept_df=pd.read_sql_query(βselect * from deptβ,engine)
dept_df.head(10)
3. Transform Dataset
Create AWS Redshift Target Table using the belowΒ script:
create table emp (
empno integer,
ename varchar(20),
sal integer,
comm float,
deptno integer,
dname varchar(20)
);
Join the EMP and DEPT datasets:
joined_df=pd.merge(emp_df,dept_df,left_on=βdeptnoβ,right_on=βdeptnoβ,how=βinnerβ)
joined_df.head(10)
Drop the the columns that are not present inΒ target:
joined_df.drop(columns=['job','mgr','hiredate','loc'],inplace=True)
joined_df.head(10)
4. Create Redshift connection and insertΒ data
#create connection object
conn=create_engine(βpostgresql+psycopg2://<dbuser>:<dbpassword>@<cluster_endpoint_URL>:5439/<dbname>β)
joined_df.to_sql(βempβ, conn, index=False, if_exists=βappendβ)
Verify the data in the RedshiftΒ table.
Querying the βempβ table from AWS console, we can also set up SQLWorkbench on local system to query RedshiftΒ tables.
DML operation is successful.
5. Connectivity issue IΒ faced
OperationalError: (psycopg2.OperationalError) could not connect to server: Connection timed out (0x0000274C/10060) Is the server running on host βredshift_cluster_name.unique_here.region.redshift.amazonaws.comβ (<IP address>) and accepting TCP/IP connections on portΒ 5439?
Issue Description
The issue was that the inbound rule in the Security Group specified a security group as the source. Changing it to a CIDR that included my IP address fixed theΒ issue.
How toΒ Fix?
Go to Cluster Properties β NetworkΒ Security
GO to VPC Security Group β Inbound rules βEdit inbound rules and Add both below rules β Click SaveΒ Rules.
And we are ready to go. In absence of the second rule, there might be a situation where one may face connectivity issues with AWS Redshift DB. So follow the above steps to avoid/resolve theΒ issue.
Thanks to all for reading my blog. Do share your views or feedback.
AWS Redshift ETL using Pandas API was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI
Tennie Maitland Β
Good tidings, Generally I never remark on online journals yet your article is persuading to the point that I never stop myself to say something regarding it. You’re working effectively, Keep it up. You can look at this article, may be of help π
Salvador Schmalz Β
Unbelievable news, For the most part I never comment on online diaries yet your article is convincing to the point that I never stop myself to say something concerning it. You’re working practically, Keep it up. You can see this article, might be of help π