Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Big-Data Pipelines with SparkML
Data Analysis   Data Science   Machine Learning

Big-Data Pipelines with SparkML

Last Updated on January 6, 2023 by Editorial Team

Author(s): Lawrence Alaso Krukrubo

Data Analysis, Data Science, MachineΒ Learning

Creating Apache Spark ML Pipelines for Big-DataΒ Analysis

Photo by Rodion Kutsaev onΒ Unsplash

Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a singleΒ step.

Many Data Scientists hack together models without Pipelines… (Kaggle)

Kaggle also says Pipelines have some important benefits, suchΒ as:

  1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a Pipeline, you won’t need to manually keep track of your training and validation data at eachΒ step.
  2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
  3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale, but Pipelines canΒ help.
  4. More Options for Model Validation: We can easily apply cross-validation and other techniques to our Pipelines.

So, a Pipeline is a convenient process of designing our data preprocessing and Machine Learning flow. There are certain steps that we must do before the actual ML begins. These steps are called data-preprocessing and/or feature engineering.

Some Pipeline stepsΒ include:

  • Converting categorical values to nominal & numerical values
  • Normalizing the range of values per dimension
  • One-Hot encoding categorical valuesΒ and…
  • Modeling… where we train our ML algorithm.

The overall idea of Pipelines is that we can fuse our complete data processing flow into one single pipeline, and that single pipeline we can further use downstream.

Some PipelineΒ Methods:

Image from IBM scalable-machine-learning-for-big-data-with-apache-spark

Pipeline as a Machine Learning Algorithm has the following methods…

  • Fit: Fit basically starts theΒ training
  • Score: Score gives back the predicted value.
  • Evaluate: Evaluates the model performance on the validation data.

One extra advantage of Pipelines is that we can cross-validate, meaning, try out many, many parameters using the very same Pipeline. And this really accelerates the optimization of the algorithm.

So, in summary, pipelines are facilitating our day to day work in machine learning as we can draw from pre-defined data processing steps, we make sure everything is aligned, and we can switch and swap our algorithms asΒ needed.

While there are plenty of materials covering Pipelines for machine learning, today we shall focus on Pipelines for machine learning on Big-data, using ApacheΒ SparkML.

1. Intro:

For this exercise, we shall use the HMP dataset. It’s basically accelerometer recordings from an accelerometer sensor attached to the human body. The data records sensors from humans when performing activities like brush-teeth, comb-hair, eat-soup, and soΒ on.

So, we shall preprocess this dataset for Machine Learning tasks. First, manually, then we shall build a SparkML Pipeline to preprocess the dataset automatically for us. This Pipeline can be applied to future datasets.

I use Colab for my data exploration. If you need help starting Pyspark in Colab, see thisΒ link.

So first, we set up Pyspark inΒ Colab…

2. Data Extraction

Note that the data is in a parquet file format. Parquet uses compression and column store, which maps the data layout to the Apache Spark Tungsten memoryΒ layout.

So we can see there are different folders in this HMP_dataset. Folders representing different activities such as Brush_teeth, Drink_glass, Getup_bed, Pour_water, Use_telephone.

Looking at the text files in the Brush_teeth folder, for example, we can see the accelerometer data represented in three numeric columns we may call X, Y,Β Z.

!head HMP_Dataset/Brush_teeth/Accelerometer-2011-04-11-13-28-18-brush_teeth-f1.txt
>>
22 49 35
22 49 35
22 52 35
22 52 35
21 52 34

Let’s recursively traverse through these folders in HMP_Dataset and create an Apache spark DataFrame from these text files. Then we just union all DataFrames <df.union(df2)> into one overall DataFrame containing all theΒ data.

First, Let’s define the schema of the data frameΒ below.

Now let’s traverse through the data using the OSΒ library.

Let’s remove non-action folders from the file list. These are typically folders without an underscore in theirΒ names.

Okay, we have all the folders containing the data in one array. Now we can iterate over thisΒ array.

So from the Git gist above, first, we define an empty DataFrame and import tqdm for progress-bars. Next, we import the lit library that helps us write String literal columns to an Apache Spark DataFrame.

At this point, all we do is go through each file using the OS library, add the three numeric columns to the X, Y, Z schema we defined earlier, add these to the DataFrame and add two String Literal columns, one for the class of the accelerometer reading and the other for the source file for theΒ reading.

Let’s see the schema of the DataFrame…

df.printSchema()
>>
root
|-- x: integer (nullable = true)
|-- y: integer (nullable = true)
|-- z: integer (nullable = true)
|-- class: string (nullable = false)
|-- source: string (nullable = false)

Let’s see the first 10 rows of the DataFrame. This takes a little while to run, cos as we know, DataFrames in Apache Spark are alwaysΒ lazy…

3. Data Transformation

Now we need to transform the data and create an integer representation of the class column as ML algorithms cannot cope with a String. So we will transform the class into a number of integers using the StringIndexer module.

The StringIndexer is an estimator having both fit and transform methods. So we create a StringIndexer object (indexer), pass the β€˜class’ column as inputCol, and β€˜classIndex’ as outputCol. Then we fit the DataFrame to the indexer, and transform the DataFrame. This creates a brand new DataFrame (indexed), which we can see above, containing the classIndex additional column.

4. One-Hot Encoding:

With the class index column, we can now do one-hot-encoding inΒ Pyspark…

Unlike the StringIndexer, OneHotEncoder is a pure transformer, having only the transform method. It uses the same syntax of β€˜inputCol’ and β€˜outputCol’ we saw in StringIndexer. We pass β€˜classIndex’ and β€˜categoryVec’ values, respectively. OneHotEncoder also creates a brand new DataFrame (encoded), with a β€˜category_vec’ column added to the previous DataFrame(indexed).

One more thing to note is that calling encoded.show(10, False) in the Git gist above, ensures that 10 rows are displayed and False ensures that each column element is fully expanded, cos normally SparkML compresses columnΒ cells.

Finally, OneHotEncoder in SparkML doesn't return several columns containing only zeros and one at a point where a value existed, as we all know… Rather it returns a sparse-vector as seen in the categoryVec column. Thus, for the β€˜Drink_glass’ class above, SparkML returns a sparse-vector that basically says there are 13 elements, and at position 2, the class value exists(1.0).

5. VectorAssembler:

The next thing we need to do is to transform our numerical columns X, Y, Z into vectors because sparkML can only work on vector objects. So let’s import Vectors and VectorAssembler libraries

The VectorAssembler object gets initialized with the same syntax we used in StringIndexer and OneHotEncoder. We pass list [β€˜x’, β€˜y’, β€˜z’] to inputCols, and we specify outputCol = β€˜features’. This is also a pure transformer like OneHotEncoder. So we transform the DataFrame from the last step (encoded) into a new DataFrame (features_vectorized) with the features columnΒ added.

The consistency of SparkML syntax is quite impressive, it reduces the learning curve for Big-Data Enthusiasts…

5. Normalizing TheΒ Dataset:

So, the next step is to normalize the data set. This makes the range of values in the data set for all numerical columns to be between 0 and 1 or -1 and 1. The idea is to have all features data within the same range, so no one overshadows theΒ other.

We must have gotten used to this by now, StringIndexer, OneHotEncoder, VectorAssembler, and Normalizer all have consistent syntax. Looking at the Normalizer object, it contains the parameter p=1.0. Note that the default norm value for Pyspark Normalizer isΒ p=2.0.

p=1.0 means the features are normalized based on the Manhattan Distance. Manhattan distance or taxi-cab distance between two points is the sum of the absolute difference between corresponding coordinates of these two points in the feature vectors. So these features are normalized based on thisΒ metric.

While…

p=2.0 means the features are normalized based on the Euclidean Distance. Euclidean distance between two points is exactly the same as computing the magnitude of the vector connecting these twoΒ points.

Note that these are the same methods used by clustering algorithms like KNN and K-means, forΒ example.

This is basic to intermediate Linear Algebra, My advice to budding Data Scientists is… Quit chasing each fancy, shiny Algorithm you hear of, rather, spend time to build a solid foundation for Data Science, based on Linear-Algebra, Statistics, Probability and Calculus…

6. Creating The Pipeline:

The Pipeline constructor below takes an array of Pipeline stages we pass to it. Here we pass the 4 stages above in the right sequence, one afterΒ another.

And that's it! Creating Pipelines in Apache SparkML is as straight as aΒ ruler…

We define the steps or stages and pass them in a logical sequence to the Pipeline constructor.

Now let’s fit the Pipeline object to our original dataΒ frame…

data_model = pipeline.fit(df)

Finally, let’s transform our data frame using the PipelineΒ Object.

pipelined_data = data_model.transform(df)

Let’s see the first tenΒ rows…

So we see that exactly the same DataFrame as created before from the individual stages has been created using the Pipeline function. Now we can fit and transform our data in one go. This is a really handy function.

So at this point, we simply drop the other columns that we don’tΒ need…

We use a list-comprehension to select the cols we need (categoryVec and features_norm) and simply create a new DataFrame with theseΒ columns.

So finally, we have our categoryVec column, which is the target variable, and our features_norm column, which is the feature set for the Machine Learning algorithm we’ve been preparing toΒ train…

Summary…

Photo by Luca Bravo onΒ Unsplash

We have seen how to create Apache spark ML Pipelines from our data set. Go out there and use this knowledge to build more robust data and Machine-Learning solutions.

The complete Notebook can be found here onΒ Github.

Credit goes to The IBM Advanced Data Science Team at Coursera…

Cheers!!

About Me:

Lawrence is a Data Specialist at Tech Layer, passionate about fair and explainable AI and Data Science. I believe that sharing knowledge and experiences is the best way to learn. I hold both the Data Science Professional and Advanced Data Science Professional certifications from IBM and the IBM Data Science Explainability badge, as well as the Artificial Intelligence Nanodegree from Udacity. I have conducted several projects using ML and DL libraries. I love to code up my functions as much as possible. Finally, I never stop learning and exploring, and yes, I have written several highly recommended articles.

Feel free to find meΒ on:-

Github

Linkedin

Twitter


Big-Data Pipelines with SparkML was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓