PySpark For Beginners
Last Updated on September 22, 2022 by Editorial Team
Author(s): Muttineni Sai Rohith
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
PySpark is a Python API for Apache Spark. Using PySpark, we can run applications parallelly on the distributed cluster (multiple nodes).
So we will start the theory part first as to why we need the Pyspark and the background of Apache Spark, features and cluster manager types, and Pyspark modules and packages.
Apache Spark is an analytical processing engine for large-scale, powerful distributed data processing and machine learning applications. Generally, Spark is written in Scala, but for industrial adaption, Python APIβββPySpark is released to use spark with Python. In real-time, PySpark is used a lot in the machine learning & Data scientists community; Spark runs operations on billions and trillions of data on distributed clusters 100 times faster than the traditional python applications.
Pyspark Architecture
Apache Spark works in a master-slave architecture where the master is called βDriverβ and slaves are called βWorkersβ. Spark Driver creates a spark context that serves as an entry point to the application, and all operations run on the worker nodes, and resources are managed by the clusterΒ manager.
Cluster ManagerΒ Types
The system currently supports several cluster managers besides, we can also run spark locally on our desktop/system:
- Standaloneβββa simple cluster manager included with Spark that makes it easy to set up aΒ cluster.
- Apache Mesosβββa general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
- Hadoop YARNβββthe resource manager in Hadoop 2 and 3. Mostly used clusterΒ Manager
- Kubernetesβββan open-source system for automating deployment, scaling, and management of containerized applications.
Pyspark conf, Pyspark context, and PysparkΒ session:
Pyspark conf: The SparkConf offers configuration for any Spark application. To start any Spark application on a local Cluster or a dataset, we need to set some configuration and parameters, and it can be done using SparkConf.
Features of PysparkΒ conf:
- set(key, value)βββSet a configuration property.
- setMaster(value)βββSet master URL to connectΒ to.
- setAppName(value)βββSet application name.
- get(key,defaultValue=None)βββGet the configured value for some key, or return a default otherwise.
- setSparkHome(value)βββSet path where Spark is installed on workerΒ nodes.
Pyspark context: SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on workerΒ nodes.
The Spark driver program creates and uses SparkContext to connect to the cluster manager to submit PySpark jobs and know what resource manager to communicate to. It is the heart of the PySpark application.
We can create only one SparkContext per JVM. In order to create another first, you need to stop the existing one by using stop() method. SparkContext is available in default as βscβ. So creating the other variable instead of sc will give anΒ error.
Pyspark session: Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. SparkSession is a combined class for all different contexts we used to have prior to 2.0 release (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior toΒ 2.0.
Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession; many features of SparkContext are still available and used in Spark 2.0 and later. SparkSession internally creates SparkConfig and SparkContext with the configuration provided withΒ it.
We can create as many SparkSession as you want in a PySpark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep PySpark tables (relational entities) logically separated.
Creating a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Practice").getOrCreate()
spark
Pyspark Modules andΒ Packages
PySpark RDDβββResilient Distributed Dataset:
βResilient Distributed Datasets (RDD) is a distributed memory abstraction that helps a programmer to perform in-memory computations on a large cluster.β One of the important advantages of RDD is fault tolerance, which means if any failure occurs it recovers automatically. RDD becomes immutable when it is created i.e., it cannot be changed once it isΒ created.
RDD divides data into smaller parts based on a key. The benefit of dividing data into smaller chunks is that if one executor node fails, another node will still process the data. So it is able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. RDD provides the functionality to perform functional calculations against the dataset very quickly by binding the multipleΒ nodes.
Pyspark For Beginners| Part-4: Pyspark RDD
Pyspark DataFrame:
DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existingΒ RDDs.
Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than pandas. In other words, pandas DataFrames run operations on a single node, whereas PySpark runs on multiple machines.
Pyspark For Begineers| Part-2: Pyspark DataFrame
Pyspark SQL:
PySpark SQL is a module in Spark which integrates relational processing with Sparkβs functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language.
In other words, Spark SQL brings native RAW SQL queries on Spark, meaning you can run traditional ANSI SQLβs on Spark Dataframe, in the above section of this PySpark tutorial, you will learn in detail about using SQL select, where, group by, join, unionΒ e.t.c
PySpark SQL will be easy to use, where you can extend the limitation of traditional relational data processing. Spark also supports the Hive Query Language, but there are limitations of the Hive database. Spark SQL was developed to remove the drawbacks of the Hive database.
Pyspark MLlib:
Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning API in Python asΒ well.
It supports different kinds of algorithms, which are mentioned below
β mllib.clustering
βββmllib.classification
βββmllib.fpm
βββmllib.linalg
βββmllib.recommendation
βββmllib.regression
Pyspark MLlib | Classification using Pyspark ML
Pyspark GraphFrames:
PySpark GraphFrames are introduced in Spark 3.0 version to support Graphs on DataFrameβs. Prior to 3.0, Spark has GraphX library, which ideally runs on RDD and loses all Data Frame capabilities.
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graphΒ queries.
Pyspark Streaming:
PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards etc.
This is the theoretical part forΒ Pyspark
Go through the content links provided above to understand and implement Pyspark.
Happy Codingβ¦
PySpark For Beginners was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI