Large Language Model Training Pipeline For NLP Text Classification
Author(s): Lu Zhenna
Originally published on Towards AI.
Summary
I want to share a project that got me an interview. It will benefit aspiring data scientists who are less experienced than me and especially those who need an LLM project for their portfolio. This project will train a few baseline models and a RoBERTa model to perform a binary classification task on short texts. It will be structured into a training and inference pipeline. The trained models will be subsequently packaged as a Flask app that allows users to key in text inputs in a browser and get predictions promptly. In short, users only interact with a web page. Sounds cool right? Let’s get our hands dirty!
Target Audience
- Aspiring data scientists who need hands-on experience with MLOps and LLM training but are too intimidated to start.
- Data scientists who want to abandon Jupyter Notebooks for the executable Python scripts.
- Data professionals who want to learn some basic
dockerskills. - Data scientists who want to try
PyTorchwithout a GPU.
Outline
- Repo Structure
- Training pipeline
- Inference Pipeline
- Containerize the Trained Models and Pipelines
- Containerize the Web App and Serve the Model
This article aims to teach MLOps rather than machine learning. However, if you need more specific guidance that is not covered in this article, feel free to raise your questions in the comment section, I might dedicate a new article to address your question.
Problem Statement:
My task was to develop an end-to-end training pipeline that can perform a binary text classification task. I was specifically told to train a Large Language Model, even though smaller models could also get the job done. (I know it’s an overkill, but such is life, and so is the job market.) Subsequently, the trained model should serve as a Flask application for users to enter a text message and get a prediction.
- Repo Structure
Before we delve into the details, feel free to clone my repo by running git clone https://github.com/Zhenna/LLM_phishing_detection.git in the terminal. You need a dataset for binary classification. Let’s take a spam email dataset for now and put it in the raw_data/ directory. Make sure your repo has everything shown below.
LLM_PHISHING_DETECTION
├── src # model pipeline
│ ├── __init__.py
│ ├── get_data.py
│ ├── infer.py
│ ├── main.py
│ ├── preprocess.py
│ ├── train.py
│ └── utils_and_constants.py
├── raw_data (confidential)
│ └── {data_name}.csv (confidential)
├── templates
│ ├── index.html
│ └── predict.html
├── .dockerignore
├── Dockerfile # containerize web app
├── Dockerfile.python # containerize python code
├── app.py # inference endpoint
├── requirements.txt
└── README.md
The src contains the Python scripts to preprocess data, train a model, and make an inference. The raw_data contains the csv file with text and the corresponding binary labels. The templates contains the html files for the users to type a text message and receive the prediction. In the root directory, there are docker files that containerize the code and web app, the app.py to call the Flask app, and the requirements.txt with all dependencies.
Since DevOps is not the main focus here, I won’t elaborate too much on what docker does and why we need it. You can still run the pipeline without touching docker and Flask app. In this case, please create a new python virtual environment and install all dependencies before we start by running pip install -r requirements.txt.
If you encounter the ModuleNotFoundError: No module named ‘src’ error message while running the code, please run export PYTHONPATH=$PYTHONPATH:LLM_phishing_detection to add path to Python in the terminal.
Once you have completed this step, you are halfway done. Congratulations!
2. Training Pipeline
Let’s do it together and I will explain more along the way.
The command line arguments were defined in src/main.py. You may view the code in my GitHub repo. The argument parser will collect the optional or required commands from the user. If you are not familiar with argparse, please read the official documentation here.

Since we will be using a different training set, hence different filename, dataframe columns, we have to enter more parameters. If you are using the spam email dataset from Kaggle, you may enter the arguments as shown above. More specifically, you need to indicate the dataset name, the column header for text, and the column header for binary labels.
You may have noticed from the argument parsers that for the training pipeline, there are two choices: choices=[“LLM”, “baseline”]. The baseline models for this exercise will include the following: Naive Bayes, Logistic Regression, KNN, SVM, and XGBoost. (Guess which model will produce the best accuracy?) On the other hand, LLM models only include RoBERTa. You can actually include more, however, because my machine does not have a GPU, I will stick to one LLM model, RoBERTa.

Shall we start from the baseline models? Type python src/main.py -t train -mt baseline -c emails.csv -l spam -n text in the terminal.

It should be fast. The evaluation metrics of all baseline model will appear. For this spam email dataset, Logistic Regression has the best performance. For my original binary training data with balanced labels, XGBoost was the best model. The simple model is actually pretty powerful, right?

Why did I include f1, precision, and recall beside accuracy? It’s because the binary training dataset might be imbalanced.

After training, I have saved a copy of the trained model, so that during inference, we can retrieve these models for direct prediction.

Next, let’s try and train a LLM model. To be more precise, we will fine-tune a pre-trained RoBERTa model.

Once the training starts, a progress bar with the estimated time to finish will appear. My local machine does not have a GPU, so fine-tuning a RoBERTa takes approximately 18 hours.

During training, if you encounter this error message:FileNotFoundError: Couldn’t find a module script at /Users/…/LLM_phishing_detection/f1/f1.py. Module ‘f1’ doesn’t exist on the Hugging Face Hub either., it means we have to install the evaluate library directly from the source. Try running git clone https://github.com/huggingface/evaluate.git in the terminal.
When training completes, you will see the evaluation metrics of the LLM. With the spam email dataset, the RoBERTA has outperformed all the baseline models and achieved a f1 score of 0.99.

The most important of all, make sure the RoBERTA model has been saved to the local directory after training under outputs/model/roberta-trained.

After training the models for painfully long, we can finally use them for prediction.
3. Inference Pipeline
Let’s test it out!
We will use the best performing baseline model to predict if the sample message, "Congratulations! You won a price!", is a spam. Please enter the command python src/main.py -t infer -mt baseline -c emails.csv -l spam -n text -i 'Congratulations! You won a price!'.

Uh-oh, a moment of awkwardness! The prediction is negative. The message Congratulations! You won a price! passed as a non-spam.

Next, let’s try again using LLM. Enter the command python src/main.py -t infer -mt LLM -c emails.csv -l spam -n text -i ‘Congratulations! You won a price!’.

The LLM seems more reasonable and returned a positive prediction. This is the most spammy message I can come up with. LOL.

Later, we shall deploy the LLM to users.
If ML model deployment does not interest you, you may stop here. Otherwise, please continue reading.
4. Containerize the Trained Models and Pipelines
Usually data scientists will tweak the model parameters a few times or even modify the model architectures to enhance the model performance. Let’s skip this step and go straight to packaging all the trained models in a docker container. The end goal is to serve the models through a web app later.
Why should we “containerize” it? Because we want to replicate the running environment in the customer’s machine. We don’t want the customers to get a lot of ModuleNotFoundError when they are using our product. In addition, the python version and every dependency’s version discrepancy might also break the code.
The Dockerfile.python is what we need to containerize the trained models and pipeline.
FROM python:3.11-slim
WORKDIR /LLM_PHISHING_DETECTION/
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
COPY . .
ENV PYTHONPATH="/LLM_PHISHING_DETECTION"
ENTRYPOINT [ "python3", "src/main.py" ]
Please make sure you have installed docker desktop locally. (Here is the official documentation on how to install docker desktop. Let me know if a separate article on docker will benefit you. :))
To build a docker image, run this command docker build . -f Dockerfile.python -t <docker-image-name> to build a docker image named <docker-image-name> in the present working directory.

On a side note, you may make use of the .dockerignore file and hide certain files from the docker image. In fact, the only thing that is useful to us is the trained LLM model and the inference pipeline code. However, for demonstration purpose, we will containerize everything so that we can use docker to train models too.
For sanity check, please enter docker images and check if the image has been built successfully.

As shown above, the docker image named my-model is ready for use. To run it, instead of using the python ... commands, we shall use docker ... commands instead.

Please control the urge to re-train the LLM. Let’s use the command docker run my-model -t infer -mt LLM -c emails.csv -l spam -n text -i ‘I enjoy learning MLOps!’ to make an inference.

This message is categorized as a non-spam. So far so good.

Isn’t it amazing? You have learnt so much from training models to deploying models. One more step before we wrap it up. Hang in there!
5. Containerize the Web App and Serve the Model
To data scientists, model performance matters the most. To users/customers, how they interact with the deployed model matters a lot too. My point is that we shouldn’t expect the customers to run any commands or write code. We have to build a front-end user interface, for example, a web page.
To make a simplistic web app, we will only use two endpoints. First, the index.html that takes the text input from users. Second, the predict.html that returns the prediction to users.

Both templates can be found under templates/ directory. You need to amend the page name and headers accordingly.
Next, we need a complete docker image with the Web app inside. Let’s use the Dockerfile instead of the Dockerfile.python instead.
FROM python:3.11-slim
EXPOSE 5000/tcp
WORKDIR /LLM_PHISHING_DETECTION/
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
COPY . .
CMD ["flask", "run", "--host", "0.0.0.0" ]
You may notice that the Dockerfile for web app has a few more parameters like the port number 5000 and the host 0.0.0.0. Make sure it’s consistent with what the app.py uses when calling the Flask web app.
if __name__ == '__main__':
# app.run(debug=True)
app.run(host="0.0.0.0", port=5000)
To build a docker image, run this command docker build . -t my-app.

Then, run docker images to check if it has been built. The new docker image named my-app appeared. Let’s try it out!

To start the flask server, run docker run -p 5000:5000 my-app in the terminal.

Next, copy and paste the http://127.0.0.1:5000 to your browser.

The first Flask endpoint routes to the index.html for users to enter a text message.

By clicking “Predict” button, it will route to the predict.html with a prediction label. The prediction is positive again.

We have pulled off a pretty intense project! Good job, everybody!
Link to my repo: https://github.com/Zhenna/LLM_phishing_detection
Follow me on LinkedIn | 👏🏽 for my story | Follow me on Medium
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.