Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take the GenAI Test: 25 Questions, 6 Topics. Free from Activeloop & Towards AI

Publication

Use of Pretrained BERT to Predict the Rating of Reviews
Latest   Machine Learning

Use of Pretrained BERT to Predict the Rating of Reviews

Last Updated on June 3, 2024 by Editorial Team

Author(s): Greg Postalian-Yrausquin

Originally published on Towards AI.

BERT is a state-of-the-art algorithm designed by Google to process text data and convert it into vectors (https://en.wikipedia.org/wiki/BERT_(language_model) . These can then by analyzed by other models (classification, clustering, etc) to produce different analyses.

What makes BERT special is, apart from its good results, the fact that it is trained over billions of records and that Hugging Face provides already a good battery of pre-trained models we can use for different ML tasks.

That being said, pretrained BERT is a good tool to use when the language used is clean of typos and is in a β€œstandard, day-to-day” language.

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from io import StringIO
from html.parser import HTMLParser
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import warnings
import tensorflow as tf
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.utils import resample
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Let’s take a quick look at the dataset

maindataset = pd.read_csv("Restaurant_reviews.csv")
maindataset

It is clear that a quick clean up is need. For BERT, since I am going to use a pretrained model I will not remove stopwords, common use words or stem.

I use a set of functions to remove β€œjunk” that I have ready to use in NLP

class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

def preprepare(eingang):
ausgang = strip_tags(eingang)
ausgang = eingang.lower()
ausgang = ausgang.replace(u'\xa0', u' ')
ausgang = re.sub(r'^\s*$',' ',str(ausgang))
ausgang = ausgang.replace('|', ' ')
ausgang = ausgang.replace('Γ―', ' ')
ausgang = ausgang.replace('Β»', ' ')
ausgang = ausgang.replace('ΒΏ', '. ')
ausgang = ausgang.replace('', ' ')
ausgang = ausgang.replace('"', ' ')
ausgang = ausgang.replace("'", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = ausgang.replace('!', ' ')
ausgang = ausgang.replace(',', ' ')
ausgang = ausgang.replace(';', ' ')
ausgang = ausgang.replace('.', ' ')
ausgang = ausgang.replace("(", " ")
ausgang = ausgang.replace(")", " ")
ausgang = ausgang.replace("{", " ")
ausgang = ausgang.replace("}", " ")
ausgang = ausgang.replace("[", " ")
ausgang = ausgang.replace("]", " ")
ausgang = ausgang.replace("~", " ")
ausgang = ausgang.replace("@", " ")
ausgang = ausgang.replace("#", " ")
ausgang = ausgang.replace("$", " ")
ausgang = ausgang.replace("%", " ")
ausgang = ausgang.replace("^", " ")
ausgang = ausgang.replace("&", " ")
ausgang = ausgang.replace("*", " ")
ausgang = ausgang.replace("<", " ")
ausgang = ausgang.replace(">", " ")
ausgang = ausgang.replace("/", " ")
ausgang = ausgang.replace("\\", " ")
ausgang = ausgang.replace("`", " ")
ausgang = ausgang.replace("+", " ")
ausgang = ausgang.replace("=", " ")
ausgang = ausgang.replace("_", " ")
ausgang = ausgang.replace("-", " ")
ausgang = ausgang.replace(':', ' ')
ausgang = ausgang.replace('\n', ' ').replace('\r', ' ')
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace(" +", " ")
ausgang = ausgang.replace('?', ' ')
ausgang = re.sub('[^a-zA-Z]', ' ', ausgang)
ausgang = re.sub(' +', ' ', ausgang)
ausgang = re.sub('\ +', ' ', ausgang)
ausgang = re.sub(r'\s([?.!"](?:\s|$))', r'\1', ausgang)
return ausgang

maindataset["NLPtext"] = maindataset["Review"]
maindataset["NLPtext"] = maindataset["NLPtext"].str.lower()
maindataset["NLPtext"] = maindataset["NLPtext"].apply(lambda x: preprepare(str(x)))

There is an extensive list of pre-trained models for BERT. In our case we are looking to do classification/regression, and we are working with uncased data (Analysis = analysis). I went in for a powerful (based on 1 billion training pairs) general-use model

You can see a list of the available pretrained model on the main page of the Sentence Transformer package: https://www.sbert.net/docs/sentence_transformer/pretrained_models.html

bertmodel = SentenceTransformer('all-mpnet-base-v2')

And that’s it, as simple as that. I produce the embeddings for the reviews based on this downloaded model

reviews_embedding = bertmodel.encode(maindataset["NLPtext"])

Final prep of the training set, normalize and show the distribution

emb = pd.DataFrame(reviews_embedding)
emb.index = maindataset.index

def properscaler(simio):
scaler = StandardScaler()
resultsWordstrans = scaler.fit_transform(simio)
resultsWordstrans = pd.DataFrame(resultsWordstrans)
resultsWordstrans.index = simio.index
resultsWordstrans.columns = simio.columns
return resultsWordstrans

emb = properscaler(emb)

emb['rating'] = pd.to_numeric(maindataset['Rating'], errors='coerce')
emb = emb.dropna()
sns.displot(emb['rating'])

I go ahead and split the set in train and test

outp = train_test_split(emb, train_size=0.7)
finaleval=outp[1]
subset=outp[0]

x_subset = subset.drop(columns=["rating"]).to_numpy()
y_subset = subset['rating'].to_numpy()
x_finaleval = finaleval.drop(columns=["rating"]).to_numpy()
y_finaleval = finaleval[['rating']].to_numpy()

Using Keras I prepared a simple neural network for regression. This means, no final activation function. After several runs, this is the best configuration found in terms of activation functions and number of neural units.

#initialize
neur = tf.keras.models.Sequential()
#layers
neur.add(tf.keras.layers.Dense(units=150, activation='relu'))
neur.add(tf.keras.layers.Dense(units=250, activation='sigmoid'))
neur.add(tf.keras.layers.Dense(units=700, activation='tanh'))

#output layer / no activation for output of regression
neur.add(tf.keras.layers.Dense(units=1, activation=None))

#using mse for regression. Simple and clear
neur.compile(loss='mse', optimizer='adam', metrics=['mse'])

#train
neur.fit(x_subset, y_subset, batch_size=5000, epochs=1000)

Predict on the test data

test_out = neur.predict(x_finaleval)

This step might not be necessary, but I do it in order to be sure that the data is between 1 and 5, as it is in the original set

output = outp[1][[0]]
scal = MinMaxScaler(feature_range=(1,5))
output['predicted'] = scal.fit_transform(test_out)
output['actual'] = y_finaleval
output = output.drop(columns=[0])
output = pd.merge(output, maindataset[['Review']], left_index=True, right_index=True)
output = output.sort_values(['predicted'], ascending=False)
pd.options.display.max_colwidth = 150
output

The results look ok at high level. Now I will examine the stats of the regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
print("R2: ", r2_score(output['actual'], output['predicted']))
print("MeanSqError: ",np.sqrt(mean_squared_error(output['actual'], output['predicted'])))
print("MeanAbsError: ", mean_absolute_error(output['actual'],output['predicted']))

Which are actually not very good. But we can still save face here, this is when knowing the use case is important.

Since the main issue is extracting the bad reviews, so I will proceed to mark those under 2.5 in the scale (1 and 2 in the original) as Bad reviews and leave the rest apart.

output["RangePredicted"] = np.where(output['predicted']<=2.5,"1.Bad","2.Other")
output["RangeActual"] = np.where(output['actual']<=2.5,"1.Bad","2.Other")

ConfusionMatrixDisplay.from_predictions(y_true=output['RangeActual'] ,y_pred=output['RangePredicted'] , cmap='PuBu')

And the model performs very well split good from bad reviews. This type of issue sometimes appears in multiclass classification. The solution is in many cases to split the dataset differently to reduce the number of classes, and if required do a second training and inference on the previously classified datasets to β€œdrilldown” into the original classes.

In this example, this is not necessary. At this point, the bad reviews can be: 1) further classified using a clustering algorithm, 2) given to the customer service department so that they can run their analysis or explain what can be improved.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓