Large Language Model Training Pipeline For NLP Text Classification
Author(s): Lu Zhenna
Originally published on Towards AI.
Summary
I want to share a project that got me an interview. It will benefit aspiring data scientists who are less experienced than me and especially those who need an LLM project for their portfolio. This project will train a few baseline models and a RoBERTa
model to perform a binary classification task on short texts. It will be structured into a training and inference pipeline. The trained models will be subsequently packaged as a Flask app that allows users to key in text inputs in a browser and get predictions promptly. In short, users only interact with a web page. Sounds cool right? Letβs get our hands dirty!
Target Audience
- Aspiring data scientists who need hands-on experience with MLOps and LLM training but are too intimidated to start.
- Data scientists who want to abandon Jupyter Notebooks for the executable Python scripts.
- Data professionals who want to learn some basic
docker
skills. - Data scientists who want to try
PyTorch
without a GPU.
Outline
- Repo Structure
- Training pipeline
- Inference Pipeline
- Containerize the Trained Models and Pipelines
- Containerize the Web App and Serve the Model
This article aims to teach MLOps rather than machine learning. However, if you need more specific guidance that is not covered in this article, feel free to raise your questions in the comment section, I might dedicate a new article to address your question.
Problem Statement:
My task was to develop an end-to-end training pipeline that can perform a binary text classification task. I was specifically told to train a Large Language Model, even though smaller models could also get the job done. (I know itβs an overkill, but such is life, and so is the job market.) Subsequently, the trained model should serve as a Flask application for users to enter a text message and get a prediction.
- Repo Structure
Before we delve into the details, feel free to clone my repo by running git clone https://github.com/Zhenna/LLM_phishing_detection.git
in the terminal. You need a dataset for binary classification. Letβs take a spam email dataset for now and put it in the raw_data/
directory. Make sure your repo has everything shown below.
LLM_PHISHING_DETECTION
βββ src # model pipeline
β βββ __init__.py
β βββ get_data.py
β βββ infer.py
β βββ main.py
β βββ preprocess.py
β βββ train.py
β βββ utils_and_constants.py
βββ raw_data (confidential)
β βββ {data_name}.csv (confidential)
βββ templates
β βββ index.html
β βββ predict.html
βββ .dockerignore
βββ Dockerfile # containerize web app
βββ Dockerfile.python # containerize python code
βββ app.py # inference endpoint
βββ requirements.txt
βββ README.md
The src
contains the Python scripts to preprocess data, train a model, and make an inference. The raw_data
contains the csv file with text and the corresponding binary labels. The templates
contains the html files for the users to type a text message and receive the prediction. In the root directory, there are docker files that containerize the code and web app, the app.py
to call the Flask app, and the requirements.txt
with all dependencies.
Since DevOps is not the main focus here, I wonβt elaborate too much on what docker
does and why we need it. You can still run the pipeline without touching docker
and Flask app. In this case, please create a new python virtual environment and install all dependencies before we start by running pip install -r requirements.txt
.
If you encounter the ModuleNotFoundError: No module named βsrcβ
error message while running the code, please run export PYTHONPATH=$PYTHONPATH:LLM_phishing_detection
to add path to Python in the terminal.
Once you have completed this step, you are halfway done. Congratulations!
2. Training Pipeline
Letβs do it together and I will explain more along the way.
The command line arguments were defined in src/main.py
. You may view the code in my GitHub repo. The argument parser will collect the optional or required commands from the user. If you are not familiar with argparse
, please read the official documentation here.
Since we will be using a different training set, hence different filename, dataframe columns, we have to enter more parameters. If you are using the spam email dataset from Kaggle, you may enter the arguments as shown above. More specifically, you need to indicate the dataset name, the column header for text, and the column header for binary labels.
You may have noticed from the argument parsers that for the training pipeline, there are two choices: choices=[βLLMβ, βbaselineβ]
. The baseline models for this exercise will include the following: Naive Bayes, Logistic Regression, KNN, SVM, and XGBoost. (Guess which model will produce the best accuracy?) On the other hand, LLM models only include RoBERTa
. You can actually include more, however, because my machine does not have a GPU, I will stick to one LLM model, RoBERTa
.
Shall we start from the baseline models? Type python src/main.py -t train -mt baseline -c emails.csv -l spam -n text
in the terminal.
It should be fast. The evaluation metrics of all baseline model will appear. For this spam email dataset, Logistic Regression
has the best performance. For my original binary training data with balanced labels, XGBoost
was the best model. The simple model is actually pretty powerful, right?
Why did I include f1
, precision
, and recall
beside accuracy
? Itβs because the binary training dataset might be imbalanced.
After training, I have saved a copy of the trained model, so that during inference, we can retrieve these models for direct prediction.
Next, letβs try and train a LLM model. To be more precise, we will fine-tune a pre-trained RoBERTa
model.
Once the training starts, a progress bar with the estimated time to finish will appear. My local machine does not have a GPU, so fine-tuning a RoBERTa
takes approximately 18 hours.
During training, if you encounter this error message:FileNotFoundError: Couldnβt find a module script at /Users/β¦/LLM_phishing_detection/f1/f1.py. Module βf1β doesnβt exist on the Hugging Face Hub either.
, it means we have to install the evaluate
library directly from the source. Try running git clone https://github.com/huggingface/evaluate.git
in the terminal.
When training completes, you will see the evaluation metrics of the LLM. With the spam email dataset, the RoBERTA
has outperformed all the baseline models and achieved a f1
score of 0.99
.
The most important of all, make sure the RoBERTA
model has been saved to the local directory after training under outputs/model/roberta-trained
.
After training the models for painfully long, we can finally use them for prediction.
3. Inference Pipeline
Letβs test it out!
We will use the best performing baseline model to predict if the sample message, "Congratulations! You won a price!"
, is a spam. Please enter the command python src/main.py -t infer -mt baseline -c emails.csv -l spam -n text -i 'Congratulations! You won a price!'
.
Uh-oh, a moment of awkwardness! The prediction is negative. The message Congratulations! You won a price!
passed as a non-spam.
Next, letβs try again using LLM. Enter the command python src/main.py -t infer -mt LLM -c emails.csv -l spam -n text -i βCongratulations! You won a price!β
.
The LLM seems more reasonable and returned a positive prediction. This is the most spammy message I can come up with. LOL.
Later, we shall deploy the LLM to users.
If ML model deployment does not interest you, you may stop here. Otherwise, please continue reading.
4. Containerize the Trained Models and Pipelines
Usually data scientists will tweak the model parameters a few times or even modify the model architectures to enhance the model performance. Letβs skip this step and go straight to packaging all the trained models in a docker container. The end goal is to serve the models through a web app later.
Why should we βcontainerizeβ it? Because we want to replicate the running environment in the customerβs machine. We donβt want the customers to get a lot of ModuleNotFoundError
when they are using our product. In addition, the python version and every dependencyβs version discrepancy might also break the code.
The Dockerfile.python
is what we need to containerize the trained models and pipeline.
FROM python:3.11-slim
WORKDIR /LLM_PHISHING_DETECTION/
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
COPY . .
ENV PYTHONPATH="/LLM_PHISHING_DETECTION"
ENTRYPOINT [ "python3", "src/main.py" ]
Please make sure you have installed docker
desktop locally. (Here is the official documentation on how to install docker desktop. Let me know if a separate article on docker will benefit you. :))
To build a docker image, run this command docker build . -f Dockerfile.python -t <docker-image-name>
to build a docker image named <docker-image-name>
in the present working directory.
On a side note, you may make use of the .dockerignore
file and hide certain files from the docker image. In fact, the only thing that is useful to us is the trained LLM model and the inference pipeline code. However, for demonstration purpose, we will containerize everything so that we can use docker to train models too.
For sanity check, please enter docker images
and check if the image has been built successfully.
As shown above, the docker image named my-model
is ready for use. To run it, instead of using the python ...
commands, we shall use docker ...
commands instead.
Please control the urge to re-train the LLM. Letβs use the command docker run my-model -t infer -mt LLM -c emails.csv -l spam -n text -i βI enjoy learning MLOps!β
to make an inference.
This message is categorized as a non-spam. So far so good.
Isnβt it amazing? You have learnt so much from training models to deploying models. One more step before we wrap it up. Hang in there!
5. Containerize the Web App and Serve the Model
To data scientists, model performance matters the most. To users/customers, how they interact with the deployed model matters a lot too. My point is that we shouldnβt expect the customers to run any commands or write code. We have to build a front-end user interface, for example, a web page.
To make a simplistic web app, we will only use two endpoints. First, the index.html
that takes the text input from users. Second, the predict.html
that returns the prediction to users.
Both templates can be found under templates/
directory. You need to amend the page name and headers accordingly.
Next, we need a complete docker image with the Web app inside. Letβs use the Dockerfile
instead of the Dockerfile.python
instead.
FROM python:3.11-slim
EXPOSE 5000/tcp
WORKDIR /LLM_PHISHING_DETECTION/
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
COPY . .
CMD ["flask", "run", "--host", "0.0.0.0" ]
You may notice that the Dockerfile
for web app has a few more parameters like the port number 5000
and the host 0.0.0.0
. Make sure itβs consistent with what the app.py
uses when calling the Flask web app.
if __name__ == '__main__':
# app.run(debug=True)
app.run(host="0.0.0.0", port=5000)
To build a docker image, run this command docker build . -t my-app
.
Then, run docker images
to check if it has been built. The new docker image named my-app
appeared. Letβs try it out!
To start the flask server, run docker run -p 5000:5000 my-app
in the terminal.
Next, copy and paste the http://127.0.0.1:5000
to your browser.
The first Flask endpoint routes to the index.html
for users to enter a text message.
By clicking βPredictβ button, it will route to the predict.html
with a prediction label. The prediction is positive again.
We have pulled off a pretty intense project! Good job, everybody!
Link to my repo: https://github.com/Zhenna/LLM_phishing_detection
Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI