Large Language Model Training Pipeline For NLP Text Classification

Author(s): Lu Zhenna

Originally published on Towards AI.

Summary

I want to share a project that got me an interview. It will benefit aspiring data scientists who are less experienced than me and especially those who need an LLM project for their portfolio. This project will train a few baseline models and a RoBERTa model to perform a binary classification task on short texts. It will be structured into a training and inference pipeline. The trained models will be subsequently packaged as a Flask app that allows users to key in text inputs in a browser and get predictions promptly. In short, users only interact with a web page. Sounds cool right? Let’s get our hands dirty!

Target Audience

Aspiring data scientists who need hands-on experience with MLOps and LLM training but are too intimidated to start.
Data scientists who want to abandon Jupyter Notebooks for the executable Python scripts.
Data professionals who want to learn some basic docker skills.
Data scientists who want to try PyTorch without a GPU.

Outline

Repo Structure
Training pipeline
Inference Pipeline
Containerize the Trained Models and Pipelines
Containerize the Web App and Serve the Model

This article aims to teach MLOps rather than machine learning. However, if you need more specific guidance that is not covered in this article, feel free to raise your questions in the comment section, I might dedicate a new article to address your question.

Problem Statement:

My task was to develop an end-to-end training pipeline that can perform a binary text classification task. I was specifically told to train a Large Language Model, even though smaller models could also get the job done. (I know it’s an overkill, but such is life, and so is the job market.) Subsequently, the trained model should serve as a Flask application for users to enter a text message and get a prediction.

Repo Structure

Before we delve into the details, feel free to clone my repo by running git clone https://github.com/Zhenna/LLM_phishing_detection.git in the terminal. You need a dataset for binary classification. Let’s take a spam email dataset for now and put it in the raw_data/ directory. Make sure your repo has everything shown below.

LLM_PHISHING_DETECTION
├── src # model pipeline
│ ├── __init__.py
│ ├── get_data.py
│ ├── infer.py
│ ├── main.py
│ ├── preprocess.py
│ ├── train.py
│ └── utils_and_constants.py
├── raw_data (confidential) 
│ └── {data_name}.csv (confidential)
├── templates 
│ ├── index.html
│ └── predict.html
├── .dockerignore
├── Dockerfile # containerize web app
├── Dockerfile.python # containerize python code
├── app.py # inference endpoint
├── requirements.txt
└── README.md

The src contains the Python scripts to preprocess data, train a model, and make an inference. The raw_data contains the csv file with text and the corresponding binary labels. The templates contains the html files for the users to type a text message and receive the prediction. In the root directory, there are docker files that containerize the code and web app, the app.py to call the Flask app, and the requirements.txt with all dependencies.

Since DevOps is not the main focus here, I won’t elaborate too much on what docker does and why we need it. You can still run the pipeline without touching docker and Flask app. In this case, please create a new python virtual environment and install all dependencies before we start by running pip install -r requirements.txt.

If you encounter the ModuleNotFoundError: No module named ‘src’ error message while running the code, please run export PYTHONPATH=$PYTHONPATH:LLM_phishing_detection to add path to Python in the terminal.

Once you have completed this step, you are halfway done. Congratulations!

2. Training Pipeline

Let’s do it together and I will explain more along the way.

The command line arguments were defined in src/main.py. You may view the code in my GitHub repo. The argument parser will collect the optional or required commands from the user. If you are not familiar with argparse, please read the official documentation here.

Since we will be using a different training set, hence different filename, dataframe columns, we have to enter more parameters. If you are using the spam email dataset from Kaggle, you may enter the arguments as shown above. More specifically, you need to indicate the dataset name, the column header for text, and the column header for binary labels.

You may have noticed from the argument parsers that for the training pipeline, there are two choices: choices=[“LLM”, “baseline”]. The baseline models for this exercise will include the following: Naive Bayes, Logistic Regression, KNN, SVM, and XGBoost. (Guess which model will produce the best accuracy?) On the other hand, LLM models only include RoBERTa. You can actually include more, however, because my machine does not have a GPU, I will stick to one LLM model, RoBERTa.

Shall we start from the baseline models? Type python src/main.py -t train -mt baseline -c emails.csv -l spam -n text in the terminal.

It should be fast. The evaluation metrics of all baseline model will appear. For this spam email dataset, Logistic Regression has the best performance. For my original binary training data with balanced labels, XGBoost was the best model. The simple model is actually pretty powerful, right?

Evaluation Metrics of the baseline models.

Why did I include f1, precision, and recall beside accuracy? It’s because the binary training dataset might be imbalanced.

After training, I have saved a copy of the trained model, so that during inference, we can retrieve these models for direct prediction.

Next, let’s try and train a LLM model. To be more precise, we will fine-tune a pre-trained RoBERTa model.

Once the training starts, a progress bar with the estimated time to finish will appear. My local machine does not have a GPU, so fine-tuning a RoBERTa takes approximately 18 hours.

Training an LLM without an GPU takes a significant amount of time.

During training, if you encounter this error message:FileNotFoundError: Couldn’t find a module script at /Users/…/LLM_phishing_detection/f1/f1.py. Module ‘f1’ doesn’t exist on the Hugging Face Hub either., it means we have to install the evaluate library directly from the source. Try running git clone https://github.com/huggingface/evaluate.git in the terminal.

When training completes, you will see the evaluation metrics of the LLM. With the spam email dataset, the RoBERTA has outperformed all the baseline models and achieved a f1 score of 0.99.

The most important of all, make sure the RoBERTA model has been saved to the local directory after training under outputs/model/roberta-trained.

The trained LLM has been saved to the local directory.

After training the models for painfully long, we can finally use them for prediction.

3. Inference Pipeline

Let’s test it out!

We will use the best performing baseline model to predict if the sample message, "Congratulations! You won a price!", is a spam. Please enter the command python src/main.py -t infer -mt baseline -c emails.csv -l spam -n text -i 'Congratulations! You won a price!'.

Enter the commands to make an inference using the best-performing baseline model.

Uh-oh, a moment of awkwardness! The prediction is negative. The message Congratulations! You won a price! passed as a non-spam.

Negative prediction using the best-performing baseline model.

Next, let’s try again using LLM. Enter the command python src/main.py -t infer -mt LLM -c emails.csv -l spam -n text -i ‘Congratulations! You won a price!’.

Enter the commands to make an inference using the LLM.

The LLM seems more reasonable and returned a positive prediction. This is the most spammy message I can come up with. LOL.

Later, we shall deploy the LLM to users.

If ML model deployment does not interest you, you may stop here. Otherwise, please continue reading.

4. Containerize the Trained Models and Pipelines

Usually data scientists will tweak the model parameters a few times or even modify the model architectures to enhance the model performance. Let’s skip this step and go straight to packaging all the trained models in a docker container. The end goal is to serve the models through a web app later.

Why should we “containerize” it? Because we want to replicate the running environment in the customer’s machine. We don’t want the customers to get a lot of ModuleNotFoundError when they are using our product. In addition, the python version and every dependency’s version discrepancy might also break the code.

The Dockerfile.python is what we need to containerize the trained models and pipeline.

FROM python:3.11-slim

WORKDIR /LLM_PHISHING_DETECTION/

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

ENV PYTHONPATH="/LLM_PHISHING_DETECTION"

ENTRYPOINT [ "python3", "src/main.py" ]

Please make sure you have installed docker desktop locally. (Here is the official documentation on how to install docker desktop. Let me know if a separate article on docker will benefit you. :))

To build a docker image, run this command docker build . -f Dockerfile.python -t <docker-image-name> to build a docker image named <docker-image-name> in the present working directory.

Build a docker image of the trained model.

On a side note, you may make use of the .dockerignore file and hide certain files from the docker image. In fact, the only thing that is useful to us is the trained LLM model and the inference pipeline code. However, for demonstration purpose, we will containerize everything so that we can use docker to train models too.

For sanity check, please enter docker images and check if the image has been built successfully.

As shown above, the docker image named my-model is ready for use. To run it, instead of using the python ... commands, we shall use docker ... commands instead.

Run the docker container to train the baseline models.

Please control the urge to re-train the LLM. Let’s use the command docker run my-model -t infer -mt LLM -c emails.csv -l spam -n text -i ‘I enjoy learning MLOps!’ to make an inference.

Run the docker container to make an inference using the LLM.

This message is categorized as a non-spam. So far so good.

Isn’t it amazing? You have learnt so much from training models to deploying models. One more step before we wrap it up. Hang in there!

5. Containerize the Web App and Serve the Model

To data scientists, model performance matters the most. To users/customers, how they interact with the deployed model matters a lot too. My point is that we shouldn’t expect the customers to run any commands or write code. We have to build a front-end user interface, for example, a web page.

To make a simplistic web app, we will only use two endpoints. First, the index.html that takes the text input from users. Second, the predict.html that returns the prediction to users.

How Flask App interfaces with our trained models.

Both templates can be found under templates/ directory. You need to amend the page name and headers accordingly.

Next, we need a complete docker image with the Web app inside. Let’s use the Dockerfile instead of the Dockerfile.python instead.

FROM python:3.11-slim

EXPOSE 5000/tcp

WORKDIR /LLM_PHISHING_DETECTION/

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

COPY . .

CMD ["flask", "run", "--host", "0.0.0.0" ]

You may notice that the Dockerfile for web app has a few more parameters like the port number 5000 and the host 0.0.0.0. Make sure it’s consistent with what the app.py uses when calling the Flask web app.

if __name__ == '__main__':
 # app.run(debug=True) 
 app.run(host="0.0.0.0", port=5000)

To build a docker image, run this command docker build . -t my-app.

Build a new docker image for web app and trained model.

Then, run docker images to check if it has been built. The new docker image named my-app appeared. Let’s try it out!

To start the flask server, run docker run -p 5000:5000 my-app in the terminal.

Run the docker container to open the web app.

Next, copy and paste the http://127.0.0.1:5000 to your browser.

Enter the commands to start the flask server.

The first Flask endpoint routes to the index.html for users to enter a text message.

The web page for users to enter a text input.

By clicking “Predict” button, it will route to the predict.html with a prediction label. The prediction is positive again.

The prediction web page with the prediction label.

We have pulled off a pretty intense project! Good job, everybody!

Link to my repo: https://github.com/Zhenna/LLM_phishing_detection

Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Large Language Model Training Pipeline For NLP Text Classification

Author(s): Lu Zhenna

Summary

Target Audience

Outline

Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Understandability of Deep Learning Models

AI for Everyone: The Biggest AI Myths People Still Believe

How We Taught Machines to Think

#62 Will AI Take Your Job?

NN#6 — Neural Networks Decoded: Concepts Over Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Large Language Model Training Pipeline For NLP Text Classification

Author(s): Lu Zhenna

Summary

Target Audience

Outline

Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement