Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

PySpark For Beginners
Latest

PySpark For Beginners

Last Updated on September 22, 2022 by Editorial Team

Author(s): Muttineni Sai Rohith

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

PySpark is a Python API for Apache Spark. Using PySpark, we can run applications parallelly on the distributed cluster (multiple nodes).

Source: Databricks

So we will start the theory part first as to why we need the Pyspark and the background of Apache Spark, features and cluster manager types, and Pyspark modules and packages.

Apache Spark is an analytical processing engine for large-scale, powerful distributed data processing and machine learning applications. Generally, Spark is written in Scala, but for industrial adaption, Python APIβ€Šβ€”β€ŠPySpark is released to use spark with Python. In real-time, PySpark is used a lot in the machine learning & Data scientists community; Spark runs operations on billions and trillions of data on distributed clusters 100 times faster than the traditional python applications.

Pyspark Features and Advantages

Pyspark Architecture

Apache Spark works in a master-slave architecture where the master is called β€œDriver” and slaves are called β€œWorkers”. Spark Driver creates a spark context that serves as an entry point to the application, and all operations run on the worker nodes, and resources are managed by the clusterΒ manager.

Source: Spark.Apache.org

Cluster ManagerΒ Types

The system currently supports several cluster managers besides, we can also run spark locally on our desktop/system:

  • Standaloneβ€Šβ€”β€Ša simple cluster manager included with Spark that makes it easy to set up aΒ cluster.
  • Apache Mesosβ€Šβ€”β€Ša general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
  • Hadoop YARNβ€Šβ€”β€Šthe resource manager in Hadoop 2 and 3. Mostly used clusterΒ Manager
  • Kubernetesβ€Šβ€”β€Šan open-source system for automating deployment, scaling, and management of containerized applications.

Pyspark conf, Pyspark context, and PysparkΒ session:

Pyspark conf: The SparkConf offers configuration for any Spark application. To start any Spark application on a local Cluster or a dataset, we need to set some configuration and parameters, and it can be done using SparkConf.

Features of PysparkΒ conf:

  • set(key, value)β€Šβ€”β€ŠSet a configuration property.
  • setMaster(value)β€Šβ€”β€ŠSet master URL to connectΒ to.
  • setAppName(value)β€Šβ€”β€ŠSet application name.
  • get(key,defaultValue=None)β€Šβ€”β€ŠGet the configured value for some key, or return a default otherwise.
  • setSparkHome(value)β€Šβ€”β€ŠSet path where Spark is installed on workerΒ nodes.

Pyspark context: SparkContext is the entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on workerΒ nodes.

The Spark driver program creates and uses SparkContext to connect to the cluster manager to submit PySpark jobs and know what resource manager to communicate to. It is the heart of the PySpark application.

We can create only one SparkContext per JVM. In order to create another first, you need to stop the existing one by using stop() method. SparkContext is available in default as β€˜sc’. So creating the other variable instead of sc will give anΒ error.

Pyspark session: Since Spark 2.0 SparkSession has become an entry point to PySpark to work with RDD, and DataFrame. Prior to 2.0, SparkContext used to be an entry point. SparkSession is a combined class for all different contexts we used to have prior to 2.0 release (SQLContext and HiveContext e.t.c). Since 2.0 SparkSession can be used in replace with SQLContext, HiveContext, and other contexts defined prior toΒ 2.0.

Though SparkContext used to be an entry point prior to 2.0, It is not completely replaced with SparkSession; many features of SparkContext are still available and used in Spark 2.0 and later. SparkSession internally creates SparkConfig and SparkContext with the configuration provided withΒ it.

We can create as many SparkSession as you want in a PySpark application using either SparkSession.builder() or SparkSession.newSession(). Many Spark session objects are required when you want to keep PySpark tables (relational entities) logically separated.

Creating a SparkSession

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Practice").getOrCreate()
spark

Pyspark Modules andΒ Packages

Pyspark Modules andΒ Packages

PySpark RDDβ€Šβ€”β€ŠResilient Distributed Dataset:

β€œResilient Distributed Datasets (RDD) is a distributed memory abstraction that helps a programmer to perform in-memory computations on a large cluster.” One of the important advantages of RDD is fault tolerance, which means if any failure occurs it recovers automatically. RDD becomes immutable when it is created i.e., it cannot be changed once it isΒ created.

RDD divides data into smaller parts based on a key. The benefit of dividing data into smaller chunks is that if one executor node fails, another node will still process the data. So it is able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. RDD provides the functionality to perform functional calculations against the dataset very quickly by binding the multipleΒ nodes.

Pyspark For Beginners| Part-4: Pyspark RDD

Pyspark DataFrame:

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existingΒ RDDs.

Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than pandas. In other words, pandas DataFrames run operations on a single node, whereas PySpark runs on multiple machines.

Pyspark For Begineers| Part-2: Pyspark DataFrame

Pyspark SQL:

PySpark SQL is a module in Spark which integrates relational processing with Spark’s functional programming API. We can extract the data by using an SQL query language. We can use the queries same as the SQL language.

In other words, Spark SQL brings native RAW SQL queries on Spark, meaning you can run traditional ANSI SQL’s on Spark Dataframe, in the above section of this PySpark tutorial, you will learn in detail about using SQL select, where, group by, join, unionΒ e.t.c

PySpark SQL will be easy to use, where you can extend the limitation of traditional relational data processing. Spark also supports the Hive Query Language, but there are limitations of the Hive database. Spark SQL was developed to remove the drawbacks of the Hive database.

Pyspark MLlib:

Apache Spark offers a Machine Learning API called MLlib. PySpark has this machine learning API in Python asΒ well.

It supports different kinds of algorithms, which are mentioned below

β€” mllib.clustering
β€Šβ€”β€Šmllib.classification
β€Šβ€”β€Šmllib.fpm
β€Šβ€”β€Šmllib.linalg
β€Šβ€”β€Šmllib.recommendation
β€Šβ€”β€Šmllib.regression

Pyspark MLlib | Classification using Pyspark ML

Pyspark GraphFrames:

PySpark GraphFrames are introduced in Spark 3.0 version to support Graphs on DataFrame’s. Prior to 3.0, Spark has GraphX library, which ideally runs on RDD and loses all Data Frame capabilities.

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graphΒ queries.

Pyspark Streaming:

PySpark Streaming is a scalable, high-throughput, fault-tolerant streaming processing system that supports both batch and streaming workloads. It is used to process real-time data from sources like file system folder, TCP socket, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The processed data can be pushed to databases, Kafka, live dashboards etc.

Spark Streaming

This is the theoretical part forΒ Pyspark

Go through the content links provided above to understand and implement Pyspark.

Happy Coding…


PySpark For Beginners was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓