PySpark AWS S3 Read Write Operations
Last Updated on February 2, 2021 by Editorial Team
Author(s): Vivek Chaudhary
Cloud Computing
The objective of this article is to build an understanding of basic Read and Write operations on Amazon Web Storage Service S3. To be more specific, perform read and write operations on AWS S3 using Apache Spark Python APIΒ PySpark.
- Setting up Spark session on Spark Standalone cluster
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '-- packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
Set Spark properties Connect to SparkΒ Session:
#spark configuration
conf = SparkConf().set(βspark.executor.extraJavaOptionsβ,β-Dcom.amazonaws.services.s3.enableV4=trueβ). \
set(βspark.driver.extraJavaOptionsβ,β-Dcom.amazonaws.services.s3.enableV4=trueβ). \
setAppName(βpyspark_awsβ).setMaster(βlocal[*]β)
sc=SparkContext(conf=conf)
sc.setSystemProperty(βcom.amazonaws.services.s3.enableV4β, βtrueβ)
print(βmodules importedβ)
Set Spark Hadoop properties for all worker nodes asΒ below:
accessKeyId=βxxxxxxxxxxβ
secretAccessKey=βxxxxxxxxxxxxxxxβ
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set(βfs.s3a.access.keyβ, accessKeyId)
hadoopConf.set(βfs.s3a.secret.keyβ, secretAccessKey)
hadoopConf.set(βfs.s3a.endpointβ, βs3-us-east-2.amazonaws.comβ)
hadoopConf.set(βfs.s3a.implβ, βorg.apache.hadoop.fs.s3a.S3AFileSystemβ)
spark=SparkSession(sc)
s3a to write: Currently, there are three ways one can read or write files: s3, s3n and s3a. In this post, we would be dealing with s3a only as it is the fastest. Please note that s3 would not be available in future releases.
v4 authentication: AWS S3 supports two versions of authenticationβββv2 and v4. For more details consult the following link: Authenticating Requests (AWS Signature Version 4)βββAmazon Simple StorageΒ Service
2. Read the dataset present on localΒ system
emp_df=spark.read.csv(βD:\python_coding\GitLearn\python_ETL\emp.datβ,header=True,inferSchema=True)
emp_df.show(5)
3. PySpark Dataframe to AWS S3Β Storage
emp_df.write.format('csv').option('header','true').save('s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv',mode='overwrite')
Verify the dataset in S3 bucket asΒ below:
We have successfully written Spark Dataset to AWS S3 bucket βpysparkcsvs3β.
4. Read Data from AWS S3 into PySpark Dataframe
s3_df=spark.read.csv(βs3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv/β,header=True,inferSchema=True)
s3_df.show(5)
We have successfully written and retrieved the data to and from AWS S3 storage with the help ofΒ PySpark.
5. Issue IΒ faced
While writing the PySpark Dataframe to S3, the process got failed multiple times, throwing belowΒ error.
Solution: Download the hadoop.dll file from https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and place the same under C:\Windows\System32 directory path.
Thatβs all with the blog. Thanks to all for reading my blog. Do share your views/feedback, they matter aΒ lot.
PySpark AWS S3 Read Write Operations was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI