PySpark AWS S3 Read Write Operations

Last Updated on February 2, 2021 by Editorial Team

The objective of this article is to build an understanding of basic Read and Write operations on Amazon Web Storage Service S3. To be more specific, perform read and write operations on AWS S3 using Apache Spark Python API PySpark.

Setting up Spark session on Spark Standalone cluster

import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '-- packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

Set Spark properties Connect to Spark Session:

#spark configuration
conf = SparkConf().set(‘spark.executor.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’). \
 set(‘spark.driver.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’). \
 setAppName(‘pyspark_aws’).setMaster(‘local[*]’)

sc=SparkContext(conf=conf)
sc.setSystemProperty(‘com.amazonaws.services.s3.enableV4’, ‘true’)

print(‘modules imported’)

Set Spark Hadoop properties for all worker nodes as below:

accessKeyId=’xxxxxxxxxx’
secretAccessKey=’xxxxxxxxxxxxxxx’

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set(‘fs.s3a.access.key’, accessKeyId)
hadoopConf.set(‘fs.s3a.secret.key’, secretAccessKey)
hadoopConf.set(‘fs.s3a.endpoint’, ‘s3-us-east-2.amazonaws.com’)
hadoopConf.set(‘fs.s3a.impl’, ‘org.apache.hadoop.fs.s3a.S3AFileSystem’)

spark=SparkSession(sc)

s3a to write: Currently, there are three ways one can read or write files: s3, s3n and s3a. In this post, we would be dealing with s3a only as it is the fastest. Please note that s3 would not be available in future releases.

v4 authentication: AWS S3 supports two versions of authentication — v2 and v4. For more details consult the following link: Authenticating Requests (AWS Signature Version 4) — Amazon Simple Storage Service

2. Read the dataset present on local system

emp_df=spark.read.csv(‘D:\python_coding\GitLearn\python_ETL\emp.dat’,header=True,inferSchema=True)
emp_df.show(5)

3. PySpark Dataframe to AWS S3 Storage

emp_df.write.format('csv').option('header','true').save('s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv',mode='overwrite')

Verify the dataset in S3 bucket as below:

We have successfully written Spark Dataset to AWS S3 bucket “pysparkcsvs3”.

4. Read Data from AWS S3 into PySpark Dataframe

s3_df=spark.read.csv(‘s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv/’,header=True,inferSchema=True)
s3_df.show(5)

We have successfully written and retrieved the data to and from AWS S3 storage with the help of PySpark.

5. Issue I faced

While writing the PySpark Dataframe to S3, the process got failed multiple times, throwing below error.

Solution: Download the hadoop.dll file from https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and place the same under C:\Windows\System32 directory path.

That’s all with the blog. Thanks to all for reading my blog. Do share your views/feedback, they matter a lot.

PySpark AWS S3 Read Write Operations was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

PySpark AWS S3 Read Write Operations

Author(s): Vivek Chaudhary

Cloud Computing

Towards AI Team

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

PySpark AWS S3 Read Write Operations

Author(s): Vivek Chaudhary

Towards AI Team

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement