Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

PySpark AWS S3 Read Write Operations
Programming

PySpark AWS S3 Read Write Operations

Last Updated on February 2, 2021 by Editorial Team

Author(s): Vivek Chaudhary

Cloud Computing

The objective of this article is to build an understanding of basic Read and Write operations on Amazon Web Storage Service S3. To be more specific, perform read and write operations on AWS S3 using Apache Spark Python APIΒ PySpark.

  1. Setting up Spark session on Spark Standalone cluster
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '-- packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

Set Spark properties Connect to SparkΒ Session:

#spark configuration
conf = SparkConf().set(β€˜spark.executor.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’). \
set(β€˜spark.driver.extraJavaOptions’,’-Dcom.amazonaws.services.s3.enableV4=true’). \
setAppName(β€˜pyspark_aws’).setMaster(β€˜local[*]’)
sc=SparkContext(conf=conf)
sc.setSystemProperty(β€˜com.amazonaws.services.s3.enableV4’, β€˜true’)
print(β€˜modules imported’)

Set Spark Hadoop properties for all worker nodes asΒ below:

accessKeyId=’xxxxxxxxxx’
secretAccessKey=’xxxxxxxxxxxxxxx’
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set(β€˜fs.s3a.access.key’, accessKeyId)
hadoopConf.set(β€˜fs.s3a.secret.key’, secretAccessKey)
hadoopConf.set(β€˜fs.s3a.endpoint’, β€˜s3-us-east-2.amazonaws.com’)
hadoopConf.set(β€˜fs.s3a.impl’, β€˜org.apache.hadoop.fs.s3a.S3AFileSystem’)
spark=SparkSession(sc)

s3a to write: Currently, there are three ways one can read or write files: s3, s3n and s3a. In this post, we would be dealing with s3a only as it is the fastest. Please note that s3 would not be available in future releases.

v4 authentication: AWS S3 supports two versions of authenticationβ€Šβ€”β€Šv2 and v4. For more details consult the following link: Authenticating Requests (AWS Signature Version 4)β€Šβ€”β€ŠAmazon Simple StorageΒ Service

2. Read the dataset present on localΒ system

emp_df=spark.read.csv(β€˜D:\python_coding\GitLearn\python_ETL\emp.dat’,header=True,inferSchema=True)
emp_df.show(5)

3. PySpark Dataframe to AWS S3Β Storage

emp_df.write.format('csv').option('header','true').save('s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv',mode='overwrite')

Verify the dataset in S3 bucket asΒ below:

We have successfully written Spark Dataset to AWS S3 bucket β€œpysparkcsvs3”.

4. Read Data from AWS S3 into PySpark Dataframe

s3_df=spark.read.csv(β€˜s3a://pysparkcsvs3/pysparks3/emp_csv/emp.csv/’,header=True,inferSchema=True)
s3_df.show(5)

We have successfully written and retrieved the data to and from AWS S3 storage with the help ofΒ PySpark.

5. Issue IΒ faced

While writing the PySpark Dataframe to S3, the process got failed multiple times, throwing belowΒ error.

Solution: Download the hadoop.dll file from https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and place the same under C:\Windows\System32 directory path.

That’s all with the blog. Thanks to all for reading my blog. Do share your views/feedback, they matter aΒ lot.


PySpark AWS S3 Read Write Operations was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓