Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

A New Way of Building Machine Learning Pipelines
Latest

A New Way of Building Machine Learning Pipelines

Last Updated on January 7, 2023 by Editorial Team

Author(s): Abid Ali Awan

A New Way of Building Machine Learning Pipelines

Designing your first machine learning pipeline with a few lines of codes using Orchest. You will learn to preprocess the data, train the machine learning model, and evaluate theΒ results.

Image by Author | Elements byΒ Vecteezy

In this article, we will go through all steps required to build a ML (Machine Learning) pipeline. We will be using Kaggle binary classification data COVID-19/SARS B-cell Epitope to analyze, preprocess, train and evaluate our model. We won’t be going deep into code and how these models work as you can find a detailed explanation from my previous project published on Deepnote.

We are going to utilize SARS-CoV and B-cell datasets to classify peptides into two categories, antibodies with inducing properties are labeled as positive (1) and antibodies without inducing properties are labeled as negative (0). To learn more about the dataset read the researchΒ paper.

Orchest

Orchest is a tool for building data pipelines that don’t require DAGs and frameworks. The environment is simple to navigate, and you can code Python, R, and Julia using the data scientist’s favorite tool Jupyter notebook.

A pipeline in Orchest contains which are called steps. These steps are executable files that execute within the isolated environment and the connections define the way the data flows. We can visualize our progress by monitoring each step or we can schedule run our pipeline to receive a full report on the dashboard.

ML Pipeline byΒ Author

Additional Service

Orchest also comes with additional services such as visualizing your performance metrics on TensorBoard or write code in VSCode as these services are integrated seamlessly within the same environment.

Image byΒ Author

Scheduling Pipeline

Just like Airflow, you can schedule to run your pipeline, for a specific minute and hour of the day. This process doesn’t require you to code or even monitor the pipeline.

Overview (orchest.readthedocs.io)

Installation

Installation for the local server is easy for Linux users, but for windows, you can get a similar experience by installing additional applications.

Windows

Make sure you have everything installed mentioned below

  • Docker Engine latest version: run docker version toΒ check.
  • Docker must be configured to use WSLΒ 2.
  • Ubuntu 20.04 LTS forΒ Windows.
  • Run the script below inside the Ubuntu environment.

Linux

For Linux, you just need the latest docker engine and then run the script below to download and install all dependencies.

git clone https://github.com/orchest/orchest.git && cd orchest
./orchest install

# Verify the installation.
./orchest version --ext

# Start Orchest.
./orchest start

First Project

It’s time for us to start our local server. Type the script below within the Ubuntu environment as we will be running Linux virtual environment on windows. Make sure the Docker Engine is working properly.

After successfully running the script, you will receive a local web address that you can copy and paste into yourΒ browser.

cd orchest
./orchest start
Start local server | Image byΒ Auhor

On the landing page will see this amazing user interface. Finally, it’s time for us to create a new project by clicking on Create ProjectΒ button.

Creating project | Image byΒ Author

A project contains many pipelines, so now it’s time to create our Vaccine ML pipeline. By creating a pipeline will add vaccine.orchest file into your directory which contains metadata about everyΒ step.

Creating pipeline | Image byΒ Author

For code, we will be using our previous project and focus on building an effective pipeline.

Epitope prediction used in vaccine development

ML Pipeline

Machine Learning pipelines are independently executable code to run multiple tasks involved in preparing and training models on processed data Azure Machine Learning. The figure below explains the generic machine learning model that is used in every machine learning project. The arrows represent data flow from one isolated task to another completed the machine learning lifeΒ cycle.

Image Credit: Microsoft

Creating Steps

Let’s save theory for later and learn by practice, let’s learn by practice.

First your need to create the step by clicking on the New Step button. We need to add a step and if we don’t have a python file orΒ .ipynb in our direction, we can create a new file with one simple step shownΒ below.

First Step | Image byΒ Author

VoilΓ  we have successfully created our first step and now we need to create a few more and try to connect theΒ nodes.

Load Data Step | Image byΒ Author

We have added EDA (Exploratory Data Analysis) and Preprocessing step. Then joined them with the Load Data step so that each step has access to extracted data.

We will go into more detail on how we are going to code these stepsΒ later.

Creating additional Steps | Image byΒ Author

To code steps, click on the button Edit in JupyterLab, which will take us to the Jupyter notebook where you can startΒ coding.

Edit in Jupyter Notebook | Image byΒ Author

To run all the steps, select all by using a mouse and then click on the blue button on the bottom left called Run Selected Steps. This will run all steps one afterΒ another.

We can click on each step and check the logs, or we can just go directly to the notebook to check the progress.

orchest/orchest (github.com)

Output

After connecting nodes, let’s make some changes in ourΒ code.

  • we need to importΒ orchest
  • load data using pandasΒ read_csv
  • Concat bcell and sars dataframe.
  • Use orchest.output to output the data for the next step (using a namedΒ output).

Orchest output takes single or multiple variables and creates dataflow, which we can use in the following steps. In our case bcell, covid, sars, bcell_sars are stored in the dataflow variable calledΒ β€˜data’.

import orchest
import pandas as pd
# Convert the data into a DataFrame.
INPUT_DIR = β€œData”
bcell = pd.read_csv(f”{INPUT_DIR}/input_bcell.csv”)
covid = pd.read_csv(f”{INPUT_DIR}/input_covid.csv”)
sars = pd.read_csv(f”{INPUT_DIR}/input_sars.csv”)
bcell_sars = pd.concat([bcell, sars], axis=0, ignore_index=True)
# Output the Vaccine data.
print(β€œOutputting converted Vaccine data…”)
orchest.output((bcell, covid, sars, bcell_sars), name=”data”)
print(bcell_sars.shape)
print(β€œSuccess!”)

Outputting converted Vaccine data…
(14907, 14)
Success!

Input

Now we are going to look at the input step. That takes all four variables and used them to explore and analyze the data. I have added only a few Jupyter notebook cells to demonstrate data flow between nodes. To see complete analysis check EDA | Deepnote.

Let’s import required libraries including Orchest.

Use orchest.get_inputs function to create an object and then add the name (β€˜data’) of the data pipeline variable to extract variables from the previousΒ step.

As we can see we have successfully loaded the data from the previousΒ task.

Using PCA from sklearn to reduce dimensionality to 2 and use scatter plots to visualize the target distribution.

Input andΒ Output

Let’s use both input and output functions to extract training data and then use it to train our Random Forest Classifier. After training, we will export our data for evaluation.

Input train test split from Preprocessed step using β€˜training_data’ as name of dataΒ flow.

Using 400 estimators and fitting our train data set. As we can see our AUC score is quite good for a model without hyperparameter tuning.

Let’s output both our model and prediction for evaluation.

Final Pipeline

  1. Loading theΒ data
  2. Exploratory DataΒ analysis
  3. Using Alteryx EvalML and Microsoft FLAML to preprocess, train multiple modes and then evaluateΒ results
  4. Processing data forΒ training
  5. Train on Naive Bayes, Random Forest, CatBoost, and LightGBM.
  6. Evaluating results.
  7. Ensembling
  8. Comparing AccuracyΒ Score.
Image byΒ Author

Result

AutoML results

The final result of each model and ensemble with AUC and accuracyΒ score.

Project

You can find this project on GitHub if you want to explore each step. You can also use my GitHub repo and load it into the Orchest local server, and it will run from the get-go, with similarΒ results.

GitHub – kingabzpro/Covid19-Vaccine-ML-Pipeline: Designing your first machine learning pipeline with few lines of codes and simple drag and drop using Orchest. In this project we will train binary classification model to predict epitope which is used for vaccine development.

Orchest Cloud

The Orchest Cloud is in closed beta but you can apply for it and within few weeks you can get access. To be honest the could service was hassle-free as It was easy to load the GitHub project directly which ran smoothly.

Cloud Import byΒ Author

Conclusion

Yannick Perrenet and Rick Lamers are simply amazing guys who have helped me throughout the learning process. If I had any issues or discovered any bug, they were quick to respond and proposed an alternative solution while updating the current workflow. At first, I was facing a lot of issues with installation and loading certain libraries, but these problems were quickly solved by the Orchest Slack community. Overall, my experience of using Orchest was quite amazing, and I think this will be the future of dataΒ science.

The cloud IDE needs to adopt ML pipelines to remain competitive.

We have developed a complete machine learning pipeline from data ingestion to model training and evaluation. We have also learned how it’s simple to create steps and join nodes using a single click. I hope my article has made your life simpler and for a beginner, it’s a gold mine as you can learn, train and deploy your machine learning model from a single platform.

You can follow me on LinkedIn and Polywork where I post amazing articles on data science and machine learning.


Machine Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓