Use of Pretrained BERT to Predict the Rating of Reviews
Last Updated on June 3, 2024 by Editorial Team
Author(s): Greg Postalian-Yrausquin
Originally published on Towards AI.
BERT is a state-of-the-art algorithm designed by Google to process text data and convert it into vectors (https://en.wikipedia.org/wiki/BERT_(language_model) . These can then by analyzed by other models (classification, clustering, etc) to produce different analyses.
What makes BERT special is, apart from its good results, the fact that it is trained over billions of records and that Hugging Face provides already a good battery of pre-trained models we can use for different ML tasks.
That being said, pretrained BERT is a good tool to use when the language used is clean of typos and is in a βstandard, day-to-dayβ language.
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from io import StringIO
from html.parser import HTMLParser
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
import tensorflow as tf
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.utils import resample
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Letβs take a quick look at the dataset
maindataset = pd.read_csv("Restaurant_reviews.csv")
maindataset
It is clear that a quick clean up is need. For BERT, since I am going to use a pretrained model I will not remove stopwords, common use words or stem.
I use a set of functions to remove βjunkβ that I have ready to use in NLP
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
def preprepare(eingang):
ausgang = strip_tags(eingang)
ausgang = eingang.lower()
ausgang = ausgang.replace(u'\xa0', u' ')
ausgang = re.sub(r'^\s*$',' ',str(ausgang))
ausgang = ausgang.replace('|', ' ')
ausgang = ausgang.replace('Γ―', ' ')
ausgang = ausgang.replace('Β»', ' ')
ausgang = ausgang.replace('ΒΏ', '. ')
ausgang = ausgang.replace('', ' ')
ausgang = ausgang.replace('"', ' ')
ausgang = ausgang.replace("'", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = ausgang.replace('!', ' ')
ausgang = ausgang.replace(',', ' ')
ausgang = ausgang.replace(';', ' ')
ausgang = ausgang.replace('.', ' ')
ausgang = ausgang.replace("(", " ")
ausgang = ausgang.replace(")", " ")
ausgang = ausgang.replace("{", " ")
ausgang = ausgang.replace("}", " ")
ausgang = ausgang.replace("[", " ")
ausgang = ausgang.replace("]", " ")
ausgang = ausgang.replace("~", " ")
ausgang = ausgang.replace("@", " ")
ausgang = ausgang.replace("#", " ")
ausgang = ausgang.replace("$", " ")
ausgang = ausgang.replace("%", " ")
ausgang = ausgang.replace("^", " ")
ausgang = ausgang.replace("&", " ")
ausgang = ausgang.replace("*", " ")
ausgang = ausgang.replace("<", " ")
ausgang = ausgang.replace(">", " ")
ausgang = ausgang.replace("/", " ")
ausgang = ausgang.replace("\\", " ")
ausgang = ausgang.replace("`", " ")
ausgang = ausgang.replace("+", " ")
ausgang = ausgang.replace("=", " ")
ausgang = ausgang.replace("_", " ")
ausgang = ausgang.replace("-", " ")
ausgang = ausgang.replace(':', ' ')
ausgang = ausgang.replace('\n', ' ').replace('\r', ' ')
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = re.sub('[^a-zA-Z]', ' ', ausgang)
ausgang = re.sub(' +', ' ', ausgang)
ausgang = re.sub('\ +', ' ', ausgang)
ausgang = re.sub(r'\s([?.!"](?:\s|$))', r'\1', ausgang)
return ausgang
maindataset["NLPtext"] = maindataset["Review"]
maindataset["NLPtext"] = maindataset["NLPtext"].str.lower()
maindataset["NLPtext"] = maindataset["NLPtext"].apply(lambda x: preprepare(str(x)))
There is an extensive list of pre-trained models for BERT. In our case we are looking to do classification/regression, and we are working with uncased data (Analysis = analysis). I went in for a powerful (based on 1 billion training pairs) general-use model
You can see a list of the available pretrained model on the main page of the Sentence Transformer package: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
bertmodel = SentenceTransformer('all-mpnet-base-v2')
And thatβs it, as simple as that. I produce the embeddings for the reviews based on this downloaded model
reviews_embedding = bertmodel.encode(maindataset["NLPtext"])
Final prep of the training set, normalize and show the distribution
emb = pd.DataFrame(reviews_embedding)
emb.index = maindataset.index
def properscaler(simio):
scaler = StandardScaler()
resultsWordstrans = scaler.fit_transform(simio)
resultsWordstrans = pd.DataFrame(resultsWordstrans)
resultsWordstrans.index = simio.index
resultsWordstrans.columns = simio.columns
return resultsWordstrans
emb = properscaler(emb)
emb['rating'] = pd.to_numeric(maindataset['Rating'], errors='coerce')
emb = emb.dropna()
sns.displot(emb['rating'])
I go ahead and split the set in train and test
outp = train_test_split(emb, train_size=0.7)
finaleval=outp[1]
subset=outp[0]
x_subset = subset.drop(columns=["rating"]).to_numpy()
y_subset = subset['rating'].to_numpy()
x_finaleval = finaleval.drop(columns=["rating"]).to_numpy()
y_finaleval = finaleval[['rating']].to_numpy()
Using Keras I prepared a simple neural network for regression. This means, no final activation function. After several runs, this is the best configuration found in terms of activation functions and number of neural units.
#initialize
neur = tf.keras.models.Sequential()
#layers
neur.add(tf.keras.layers.Dense(units=150, activation='relu'))
neur.add(tf.keras.layers.Dense(units=250, activation='sigmoid'))
neur.add(tf.keras.layers.Dense(units=700, activation='tanh'))
#output layer / no activation for output of regression
neur.add(tf.keras.layers.Dense(units=1, activation=None))
#using mse for regression. Simple and clear
neur.compile(loss='mse', optimizer='adam', metrics=['mse'])
#train
neur.fit(x_subset, y_subset, batch_size=5000, epochs=1000)
Predict on the test data
test_out = neur.predict(x_finaleval)
This step might not be necessary, but I do it in order to be sure that the data is between 1 and 5, as it is in the original set
output = outp[1][[0]]
scal = MinMaxScaler(feature_range=(1,5))
output['predicted'] = scal.fit_transform(test_out)
output['actual'] = y_finaleval
output = output.drop(columns=[0])
output = pd.merge(output, maindataset[['Review']], left_index=True, right_index=True)
output = output.sort_values(['predicted'], ascending=False)
pd.options.display.max_colwidth = 150
output
The results look ok at high level. Now I will examine the stats of the regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("R2: ", r2_score(output['actual'], output['predicted']))
print("MeanSqError: ",np.sqrt(mean_squared_error(output['actual'], output['predicted'])))
print("MeanAbsError: ", mean_absolute_error(output['actual'],output['predicted']))
Which are actually not very good. But we can still save face here, this is when knowing the use case is important.
Since the main issue is extracting the bad reviews, so I will proceed to mark those under 2.5 in the scale (1 and 2 in the original) as Bad reviews and leave the rest apart.
output["RangePredicted"] = np.where(output['predicted']<=2.5,"1.Bad","2.Other")
output["RangeActual"] = np.where(output['actual']<=2.5,"1.Bad","2.Other")
ConfusionMatrixDisplay.from_predictions(y_true=output['RangeActual'] ,y_pred=output['RangePredicted'] , cmap='PuBu')
And the model performs very well split good from bad reviews. This type of issue sometimes appears in multiclass classification. The solution is in many cases to split the dataset differently to reduce the number of classes, and if required do a second training and inference on the previously classified datasets to βdrilldownβ into the original classes.
In this example, this is not necessary. At this point, the bad reviews can be: 1) further classified using a clustering algorithm, 2) given to the customer service department so that they can run their analysis or explain what can be improved.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI