Use of Pretrained BERT to Predict the Rating of Reviews

Last Updated on June 3, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

BERT is a state-of-the-art algorithm designed by Google to process text data and convert it into vectors (https://en.wikipedia.org/wiki/BERT_(language_model) . These can then by analyzed by other models (classification, clustering, etc) to produce different analyses.

What makes BERT special is, apart from its good results, the fact that it is trained over billions of records and that Hugging Face provides already a good battery of pre-trained models we can use for different ML tasks.

That being said, pretrained BERT is a good tool to use when the language used is clean of typos and is in a “standard, day-to-day” language.

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from io import StringIO
from html.parser import HTMLParser
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
import tensorflow as tf
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.utils import resample
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Let’s take a quick look at the dataset

maindataset = pd.read_csv("Restaurant_reviews.csv")
maindataset

It is clear that a quick clean up is need. For BERT, since I am going to use a pretrained model I will not remove stopwords, common use words or stem.

I use a set of functions to remove “junk” that I have ready to use in NLP

class MLStripper(HTMLParser):
 def __init__(self):
 super().__init__()
 self.reset()
 self.strict = False
 self.convert_charrefs= True
 self.text = StringIO()
 def handle_data(self, d):
 self.text.write(d)
 def get_data(self):
 return self.text.getvalue()

def strip_tags(html):
 s = MLStripper()
 s.feed(html)
 return s.get_data()

def preprepare(eingang):
 ausgang = strip_tags(eingang)
 ausgang = eingang.lower()
 ausgang = ausgang.replace(u'\xa0', u' ')
 ausgang = re.sub(r'^\s*$',' ',str(ausgang))
 ausgang = ausgang.replace('|', ' ')
 ausgang = ausgang.replace('ï', ' ')
 ausgang = ausgang.replace('»', ' ')
 ausgang = ausgang.replace('¿', '. ')
 ausgang = ausgang.replace('ï»¿', ' ')
 ausgang = ausgang.replace('"', ' ')
 ausgang = ausgang.replace("'", " ")
 ausgang = ausgang.replace('?', ' ')
 ausgang = ausgang.replace('!', ' ')
 ausgang = ausgang.replace(',', ' ')
 ausgang = ausgang.replace(';', ' ')
 ausgang = ausgang.replace('.', ' ')
 ausgang = ausgang.replace("(", " ")
 ausgang = ausgang.replace(")", " ")
 ausgang = ausgang.replace("{", " ")
 ausgang = ausgang.replace("}", " ")
 ausgang = ausgang.replace("[", " ")
 ausgang = ausgang.replace("]", " ")
 ausgang = ausgang.replace("~", " ")
 ausgang = ausgang.replace("@", " ")
 ausgang = ausgang.replace("#", " ")
 ausgang = ausgang.replace("$", " ")
 ausgang = ausgang.replace("%", " ")
 ausgang = ausgang.replace("^", " ")
 ausgang = ausgang.replace("&", " ")
 ausgang = ausgang.replace("*", " ")
 ausgang = ausgang.replace("<", " ")
 ausgang = ausgang.replace(">", " ")
 ausgang = ausgang.replace("/", " ")
 ausgang = ausgang.replace("\\", " ")
 ausgang = ausgang.replace("`", " ")
 ausgang = ausgang.replace("+", " ")
 ausgang = ausgang.replace("=", " ")
 ausgang = ausgang.replace("_", " ")
 ausgang = ausgang.replace("-", " ")
 ausgang = ausgang.replace(':', ' ')
 ausgang = ausgang.replace('\n', ' ').replace('\r', ' ')
 ausgang = ausgang.replace(" +", " ")
 ausgang = ausgang.replace(" +", " ")
 ausgang = ausgang.replace('?', ' ')
 ausgang = re.sub('[^a-zA-Z]', ' ', ausgang)
 ausgang = re.sub(' +', ' ', ausgang)
 ausgang = re.sub('\ +', ' ', ausgang)
 ausgang = re.sub(r'\s([?.!"](?:\s|$))', r'\1', ausgang)
 return ausgang

maindataset["NLPtext"] = maindataset["Review"]
maindataset["NLPtext"] = maindataset["NLPtext"].str.lower()
maindataset["NLPtext"] = maindataset["NLPtext"].apply(lambda x: preprepare(str(x)))

There is an extensive list of pre-trained models for BERT. In our case we are looking to do classification/regression, and we are working with uncased data (Analysis = analysis). I went in for a powerful (based on 1 billion training pairs) general-use model

You can see a list of the available pretrained model on the main page of the Sentence Transformer package: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

bertmodel = SentenceTransformer('all-mpnet-base-v2')

And that’s it, as simple as that. I produce the embeddings for the reviews based on this downloaded model

reviews_embedding = bertmodel.encode(maindataset["NLPtext"])

Final prep of the training set, normalize and show the distribution

emb = pd.DataFrame(reviews_embedding)
emb.index = maindataset.index

def properscaler(simio):
 scaler = StandardScaler()
 resultsWordstrans = scaler.fit_transform(simio)
 resultsWordstrans = pd.DataFrame(resultsWordstrans)
 resultsWordstrans.index = simio.index
 resultsWordstrans.columns = simio.columns
 return resultsWordstrans

emb = properscaler(emb)

emb['rating'] = pd.to_numeric(maindataset['Rating'], errors='coerce')
emb = emb.dropna()
sns.displot(emb['rating'])

I go ahead and split the set in train and test

outp = train_test_split(emb, train_size=0.7)
finaleval=outp[1]
subset=outp[0]

x_subset = subset.drop(columns=["rating"]).to_numpy()
y_subset = subset['rating'].to_numpy()
x_finaleval = finaleval.drop(columns=["rating"]).to_numpy()
y_finaleval = finaleval[['rating']].to_numpy()

Using Keras I prepared a simple neural network for regression. This means, no final activation function. After several runs, this is the best configuration found in terms of activation functions and number of neural units.

#initialize
neur = tf.keras.models.Sequential()
#layers
neur.add(tf.keras.layers.Dense(units=150, activation='relu'))
neur.add(tf.keras.layers.Dense(units=250, activation='sigmoid'))
neur.add(tf.keras.layers.Dense(units=700, activation='tanh'))

#output layer / no activation for output of regression
neur.add(tf.keras.layers.Dense(units=1, activation=None))

#using mse for regression. Simple and clear
neur.compile(loss='mse', optimizer='adam', metrics=['mse'])

#train
neur.fit(x_subset, y_subset, batch_size=5000, epochs=1000)

Predict on the test data

test_out = neur.predict(x_finaleval)

This step might not be necessary, but I do it in order to be sure that the data is between 1 and 5, as it is in the original set

output = outp[1][[0]]
scal = MinMaxScaler(feature_range=(1,5))
output['predicted'] = scal.fit_transform(test_out)
output['actual'] = y_finaleval
output = output.drop(columns=[0])
output = pd.merge(output, maindataset[['Review']], left_index=True, right_index=True)
output = output.sort_values(['predicted'], ascending=False)
pd.options.display.max_colwidth = 150
output

The results look ok at high level. Now I will examine the stats of the regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("R2: ", r2_score(output['actual'], output['predicted']))
print("MeanSqError: ",np.sqrt(mean_squared_error(output['actual'], output['predicted'])))
print("MeanAbsError: ", mean_absolute_error(output['actual'],output['predicted']))

Which are actually not very good. But we can still save face here, this is when knowing the use case is important.

Since the main issue is extracting the bad reviews, so I will proceed to mark those under 2.5 in the scale (1 and 2 in the original) as Bad reviews and leave the rest apart.

output["RangePredicted"] = np.where(output['predicted']<=2.5,"1.Bad","2.Other")
output["RangeActual"] = np.where(output['actual']<=2.5,"1.Bad","2.Other")

ConfusionMatrixDisplay.from_predictions(y_true=output['RangeActual'] ,y_pred=output['RangePredicted'] , cmap='PuBu')

And the model performs very well split good from bad reviews. This type of issue sometimes appears in multiclass classification. The solution is in many cases to split the dataset differently to reduce the number of classes, and if required do a second training and inference on the previously classified datasets to “drilldown” into the original classes.

In this example, this is not necessary. At this point, the bad reviews can be: 1) further classified using a clustering algorithm, 2) given to the customer service department so that they can run their analysis or explain what can be improved.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Use of Pretrained BERT to Predict the Rating of Reviews

Author(s): Greg Postalian-Yrausquin

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Vector Databases 101: A Beginner’s Guide to Vector Search and Indexing

AI Agent Developer: A Journey Through Code, Creativity, and Curiosity

AlphaGeometry2: A Deep Dive into a Gold-Medalist AI Geometry Solver

Why you should try RAG before Finetuning a LLM?

DeepSeek AI — Beginner’s Guide

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Use of Pretrained BERT to Predict the Rating of Reviews

Author(s): Greg Postalian-Yrausquin

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement