Code Migration: SAS to Python
Last Updated on July 20, 2023 by Editorial Team
Author(s): Vivek Chaudhary
Originally published on Towards AI.
Programming
The objective of this article is to build a fundamental understanding of SAS programming and rewrite the feature in the Python programming language. As a part of the demo, I have to build a tiny ETL process in SAS and redesigned the same in Python.
- Create SAS library and Import Datasets
--SAS library creation
libname mylib '<path>';
--import emp dataset
DATA mylib.emp; --> Data step
infile '/folders/myfolders/emp.csv'
dlm=','
FIRSTOBS=2 DSD;
input EMPNO ENAME $ SAL DEPTNO COMM;
run;
DATA mylib.dept; --> Data step
infile '/folders/myfolders/dept.csv'
dlm=','
FIRSTOBS=2 DSD;
input DEPTNO DNAME $ LOC $;
run;
2. SAS Programming structure explanation
SAS programming involves three basic steps:- Data Step, Proc Step, and Output Step.
DATA Step involves loading dataset into SAS memory (RAM), variable identification (columns are referred to as variables in SAS). In short, this step is a combination of parsing and data loading. Data or records are referred to as observations in SAS.
PROC Step involves the data analysis part, basically to invoke SAS built-in procedure to analyze the data.
OUTPUT Step: prints or stores the result of the proc step.
3. SAS ETL
The next step is to perform join on the above emp and dept datasets loaded into SAS memory. But SAS demands data be sorted before joining happens.
Sort the EMP dataset as the DEPT dataset is already sorted on join key βDEPTNO.β
proc sort data=mylib.emp; --> Proc step
by DEPTNO;
The above step is the proc step.
Next is to perform merge operation on EMP & DEPT dataset.
DATA mylib.emp_dept; --> equi join
merge mylib.dept(in=X) mylib.emp(in=Y);
by DEPTNO;
if X and Y;
run;
The default join strategy of SAS is Full Outer Join, which doesnβt require mention of any condition. In our case, the join is Equi join.
Letβs break up the join and try to understand what is what?
Merge: in SAS, a merge is a keyword for join operation.
IN: in=val, represents that keep only matching values.
By: mentions join key.
Reorder the SAS variables (columns) as the business demands the output to be in a specific format.
#reorder SAS variables
DATA mylib.emp_dept;
retain EMPNO ENAME DEPTNO DNAME SAL COMM LOC;
set mylib.emp_dept;
run;
Note: Nulls in SAS is denoted by β.β
The next step is to write the table data to a CSV file.
#write data to csv file
proc export data=mylib.emp_dept
outfile='/folders/myfolders/emp_dept.csv'
dbms= csv
replace; --> to overwrite existing file, truncate load
run;
We have successfully completed the ETL process using SAS programming. The next step is to replicate or redesign the same code using Python programming.
4. Python replica of SAS program
Import prerequisite library and dataset.
import pandas as pd
emp_df= pd.read_csv(rβD:<path>\emp.csvβ)
emp_df.head(10)
dept_df= pd.read_csv('D:<path>\dept.csvβ)
dept_df.head(5)
#merge the dataset to perform inner join
final_df=pd.merge(emp_df,dept_df[[βDEPTNOβ,βDNAMEβ,βLOCβ]],on=βDEPTNOβ,how=βinnerβ)
final_df.head(10)
The next step is to write the merged dataset to a CSV file.
#write pandas df to csv file
final_df.to_csv(βD:<path>\emp_dept.csvβ,index=False)
We have successfully written output in CSV format and replicated the SAS application code in python as well.
To summarize, first, we have designed a simple ETL application using SAS programming and then redesigned the same application using Python programming.
Thank you for supporting the content.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI