Use of Pretrained BERT to Predict the Rating of Reviews
Last Updated on June 3, 2024 by Editorial Team
Author(s): Greg Postalian-Yrausquin
Originally published on Towards AI.

BERT is a state-of-the-art algorithm designed by Google to process text data and convert it into vectors (https://en.wikipedia.org/wiki/BERT_(language_model) . These can then by analyzed by other models (classification, clustering, etc) to produce different analyses.
What makes BERT special is, apart from its good results, the fact that it is trained over billions of records and that Hugging Face provides already a good battery of pre-trained models we can use for different ML tasks.
That being said, pretrained BERT is a good tool to use when the language used is clean of typos and is in a “standard, day-to-day” language.
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from io import StringIO
from html.parser import HTMLParser
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
import tensorflow as tf
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.utils import resample
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Let’s take a quick look at the dataset
maindataset = pd.read_csv("Restaurant_reviews.csv")
maindataset

It is clear that a quick clean up is need. For BERT, since I am going to use a pretrained model I will not remove stopwords, common use words or stem.
I use a set of functions to remove “junk” that I have ready to use in NLP
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
def preprepare(eingang):
ausgang = strip_tags(eingang)
ausgang = eingang.lower()
ausgang = ausgang.replace(u'\xa0', u' ')
ausgang = re.sub(r'^\s*$',' ',str(ausgang))
ausgang = ausgang.replace('|', ' ')
ausgang = ausgang.replace('ï', ' ')
ausgang = ausgang.replace('»', ' ')
ausgang = ausgang.replace('¿', '. ')
ausgang = ausgang.replace('', ' ')
ausgang = ausgang.replace('"', ' ')
ausgang = ausgang.replace("'", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = ausgang.replace('!', ' ')
ausgang = ausgang.replace(',', ' ')
ausgang = ausgang.replace(';', ' ')
ausgang = ausgang.replace('.', ' ')
ausgang = ausgang.replace("(", " ")
ausgang = ausgang.replace(")", " ")
ausgang = ausgang.replace("{", " ")
ausgang = ausgang.replace("}", " ")
ausgang = ausgang.replace("[", " ")
ausgang = ausgang.replace("]", " ")
ausgang = ausgang.replace("~", " ")
ausgang = ausgang.replace("@", " ")
ausgang = ausgang.replace("#", " ")
ausgang = ausgang.replace("$", " ")
ausgang = ausgang.replace("%", " ")
ausgang = ausgang.replace("^", " ")
ausgang = ausgang.replace("&", " ")
ausgang = ausgang.replace("*", " ")
ausgang = ausgang.replace("<", " ")
ausgang = ausgang.replace(">", " ")
ausgang = ausgang.replace("/", " ")
ausgang = ausgang.replace("\\", " ")
ausgang = ausgang.replace("`", " ")
ausgang = ausgang.replace("+", " ")
ausgang = ausgang.replace("=", " ")
ausgang = ausgang.replace("_", " ")
ausgang = ausgang.replace("-", " ")
ausgang = ausgang.replace(':', ' ')
ausgang = ausgang.replace('\n', ' ').replace('\r', ' ')
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = re.sub('[^a-zA-Z]', ' ', ausgang)
ausgang = re.sub(' +', ' ', ausgang)
ausgang = re.sub('\ +', ' ', ausgang)
ausgang = re.sub(r'\s([?.!"](?:\s|$))', r'\1', ausgang)
return ausgang
maindataset["NLPtext"] = maindataset["Review"]
maindataset["NLPtext"] = maindataset["NLPtext"].str.lower()
maindataset["NLPtext"] = maindataset["NLPtext"].apply(lambda x: preprepare(str(x)))
There is an extensive list of pre-trained models for BERT. In our case we are looking to do classification/regression, and we are working with uncased data (Analysis = analysis). I went in for a powerful (based on 1 billion training pairs) general-use model
You can see a list of the available pretrained model on the main page of the Sentence Transformer package: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
bertmodel = SentenceTransformer('all-mpnet-base-v2')

And that’s it, as simple as that. I produce the embeddings for the reviews based on this downloaded model
reviews_embedding = bertmodel.encode(maindataset["NLPtext"])
Final prep of the training set, normalize and show the distribution
emb = pd.DataFrame(reviews_embedding)
emb.index = maindataset.index
def properscaler(simio):
scaler = StandardScaler()
resultsWordstrans = scaler.fit_transform(simio)
resultsWordstrans = pd.DataFrame(resultsWordstrans)
resultsWordstrans.index = simio.index
resultsWordstrans.columns = simio.columns
return resultsWordstrans
emb = properscaler(emb)
emb['rating'] = pd.to_numeric(maindataset['Rating'], errors='coerce')
emb = emb.dropna()
sns.displot(emb['rating'])

I go ahead and split the set in train and test
outp = train_test_split(emb, train_size=0.7)
finaleval=outp[1]
subset=outp[0]
x_subset = subset.drop(columns=["rating"]).to_numpy()
y_subset = subset['rating'].to_numpy()
x_finaleval = finaleval.drop(columns=["rating"]).to_numpy()
y_finaleval = finaleval[['rating']].to_numpy()
Using Keras I prepared a simple neural network for regression. This means, no final activation function. After several runs, this is the best configuration found in terms of activation functions and number of neural units.
#initialize
neur = tf.keras.models.Sequential()
#layers
neur.add(tf.keras.layers.Dense(units=150, activation='relu'))
neur.add(tf.keras.layers.Dense(units=250, activation='sigmoid'))
neur.add(tf.keras.layers.Dense(units=700, activation='tanh'))
#output layer / no activation for output of regression
neur.add(tf.keras.layers.Dense(units=1, activation=None))
#using mse for regression. Simple and clear
neur.compile(loss='mse', optimizer='adam', metrics=['mse'])
#train
neur.fit(x_subset, y_subset, batch_size=5000, epochs=1000)

Predict on the test data
test_out = neur.predict(x_finaleval)
This step might not be necessary, but I do it in order to be sure that the data is between 1 and 5, as it is in the original set
output = outp[1][[0]]
scal = MinMaxScaler(feature_range=(1,5))
output['predicted'] = scal.fit_transform(test_out)
output['actual'] = y_finaleval
output = output.drop(columns=[0])
output = pd.merge(output, maindataset[['Review']], left_index=True, right_index=True)
output = output.sort_values(['predicted'], ascending=False)
pd.options.display.max_colwidth = 150
output

The results look ok at high level. Now I will examine the stats of the regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("R2: ", r2_score(output['actual'], output['predicted']))
print("MeanSqError: ",np.sqrt(mean_squared_error(output['actual'], output['predicted'])))
print("MeanAbsError: ", mean_absolute_error(output['actual'],output['predicted']))

Which are actually not very good. But we can still save face here, this is when knowing the use case is important.
Since the main issue is extracting the bad reviews, so I will proceed to mark those under 2.5 in the scale (1 and 2 in the original) as Bad reviews and leave the rest apart.
output["RangePredicted"] = np.where(output['predicted']<=2.5,"1.Bad","2.Other")
output["RangeActual"] = np.where(output['actual']<=2.5,"1.Bad","2.Other")
ConfusionMatrixDisplay.from_predictions(y_true=output['RangeActual'] ,y_pred=output['RangePredicted'] , cmap='PuBu')

And the model performs very well split good from bad reviews. This type of issue sometimes appears in multiclass classification. The solution is in many cases to split the dataset differently to reduce the number of classes, and if required do a second training and inference on the previously classified datasets to “drilldown” into the original classes.
In this example, this is not necessary. At this point, the bad reviews can be: 1) further classified using a clustering algorithm, 2) given to the customer service department so that they can run their analysis or explain what can be improved.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.