Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Small -> Big -> Massive — VM to BM to Serverless Spark-based Data Science
Latest   Machine Learning

Small -> Big -> Massive — VM to BM to Serverless Spark-based Data Science

Last Updated on July 20, 2023 by Editorial Team

Author(s): Deepak Sekar

Originally published on Towards AI.

Cloud Computing

We have heard about big data platforms supporting ML workloads with distributed computing. But do you always need a big data platform for your data science workloads?

How about the flexibility to build small, scale big and span to every byte of your data?

What does it mean?

Let us understand this with a step by step flow.

Where is the data stored?

Mostly on the following two

  1. Data Lakes and
  2. Data Warehouses

Where is the Data Engineering done?

  1. On-prem servers
  2. Virtual machines in the cloud
  3. Bare metal machines in the cloud
  4. On-prem big data clusters
  5. Cloud-based big data clusters

Where is Data Science done?

  1. On-prem servers
  2. Virtual machines in the cloud w/o GPU
  3. Bare metal machines in the cloud w/o GPU
  4. On-prem big data clusters
  5. Cloud-based big data clusters
  6. On-prem data science environment
  7. Cloud-based data science environment

So it looks like there are cases where a small environment meets the needs and some instances where big clusters are required.

Open-source methods to analyze/ manipulate data

Small Data -> Moderately big Data -> Big Data

is achieved using

Pandas -> Dask -> PySpark

Why can’t the same logic be possible for scaling compute for Data Engineering/ Data Science workloads?

VM -> BM -> Spark/ Serverless Spark

How difficult is it building/ maintaining a Spark Cluster? Take a guess? Serverless/ Managed Spark Cluster is the answer

Oracle’s Cloud Infrastructure Data Science and Data Flow provides the flexibility to move from VM -> BM -> Serverless Spark (Oracle Cloud Infrastructure Data Flow)

How is this achieved?

  1. You can select the shape of computing when you start
  2. Model/ analyze data using the VM/ BM and access the data from the data lake (OCI Object Storage), Oracle databases or External Data Sources
  3. Hand-off to a serverless spark cluster (OCI Data Flow) from within OCI Data Science to run a spark application
  4. Access the results/ model back in the Data Science Environment
  5. Deploy the model using OCI Functions and API Gateway (if required)
Data Science to Data Flow
OCI Data Science
OCI Data Flow

OCI Data Flow supports

two types of templates:

  1. Thestandard_pyspark template, which is for standard PySpark job
  2. The sparksql template which is for the spark SQL job

Hand-off to OCI Data Flow from OCI Data Science

from ads.dataflow.dataflow import DataFlowdata_flow = DataFlow()import uuidpyspark_file_path = f"/home/datascience/dataflow/example-{str(uuid.uuid4())[-6:]}.py"# Object Storage bucketdisplay_name = "<>"
bucket_name = "<>"
# Create app config (can choose driver and executor size & num)app_config = data_flow.prepare_app(display_name,bucket_name,pyspark_file_path)app = data_flow.create_app(app_config)app.oci_linkrun_display_name = "sample new run"log_bucket_name = "dataflow-log"#Create run config and run the spark applicationrun_config = app.prepare_run(run_display_name, log_bucket_name)runrun.stats# Run - clickable linkrun.oci_link

If the PySpark script writes the output back to the object storage/ database, then you can access the same after the run.

Welcome to the world of data science done right!

Please don’t forget to clap if you liked this article 🙂

The views expressed are those of the author and not necessarily those of Oracle. You can find me on Medium as Deepak Sekar.

Resources:

5 Different Ways to Build ML Models!

We have come across data science platforms and ML offerings targeted for expert audiences who have Python/ R/…

medium.com

What? How? Why? — In the World of Data Science!

In this article, we will see the three things that matter the most in the Data Science Process

medium.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓