Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Simple Text Classification Using Fasttext
Latest   Machine Learning

Simple Text Classification Using Fasttext

Last Updated on November 5, 2023 by Editorial Team

Author(s): Sergei Issaev

Originally published on Towards AI.

Introduction

Natural language processing is being applied to business use cases at an exponentially higher rate. One of the simplest AI automations that can transform a business is text classification since in many cases, the classification of text data is done manually in a time-consuming process.

Text classification involves developing an AI capable of assigning labels to an input text. These AI models can be trained as a supervised learning problem, adjusting their weights based on previously seen examples with corresponding labels.

To perform text classification in Python, we can use fasttext. Fasttext is an open-source and lightweight Python library capable of quickly and easily creating text classification models.

The dataset I will use for this demo is available here. This dataset consists of Coronavirus related tweets, and associated sentiments. There are five possible classes: extremely negative, negative, neutral, positive and extremely positive. We will use fasttext to build an AI model to classify tweets using this dataset.

One thing to note is that there is a gray area between each of these labels: namely, where is the “line” that separates positive from extremely positive? Or “negative” from “extremely negative”? There are definitely some tweets that fall in the middle of these two classes, and the ground truth label might be subjective with regard to which side of the line the label ended up on. As a result, it is unlikely that a validation accuracy of 100% will ever be attained.

If you would like to try out the code yourself, the kaggle notebook is available here.

Processing

Below are some of the key preprocessing functions used for preparing data for input into the fasttext model. Note that these steps do not clean or modify the original text in any way — I will focus solely on setting up the data in the proper format for input to the ML model.

  1. process_data subselects the columns of interest (text and associated label are all we need). For the label, we add the string “__label__”, since fasttext expects the labels to have this prefix.
  2. split_data splits the data into a train set, a validation set, and a test set.
  3. save_data_as_txt creates the txt files necessary for fasttext. Fasttext expects a txt file containing text and associated labels, which is what this function creates.
def process_data(df_train: pd.DataFrame, df_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
df_train = df_train[["Sentiment", "OriginalTweet"]]
df_test = df_test[["Sentiment", "OriginalTweet"]]
df_train["Sentiment"] = df_train["Sentiment"].apply(lambda x: "__label__" + "_".join(a for a in x.split()))
df_test["Sentiment"] = df_test["Sentiment"].apply(lambda x: "__label__" + "_".join(a for a in x.split()))
return df_train, df_test

def split_data(df_train: pd.DataFrame, df_test: pd.DataFrame, train_fraction: float) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
split_point = round(df_train.shape[0] * train_fraction)
df_train, df_val = df_train.iloc[:split_point], df_train.iloc[split_point:]
return df_train, df_val, df_test

def save_data_as_txt(df_train: pd.DataFrame, df_validation: pd.DataFrame, df_test: pd.DataFrame) -> None:
df_train.to_csv(train_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)
df_validation.to_csv(validation_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)
df_test.to_csv(test_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)

Now, one simply sets some paths and calls the above functions, and we have all the data ready for fasttext to create a model!

df_train = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv", encoding="latin1")
df_test = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv", encoding="latin1")
train_filepath = "df_train.txt"
validation_filepath = "df_val.txt"
test_filepath = "df_test.txt"

df_train, df_test = process_data(df_train=df_train, df_test=df_test)
df_train, df_validation, df_test = split_data(df_train=df_train, df_test=df_test, train_fraction=0.9)
save_data_as_txt(df_train=df_train, df_validation=df_validation, df_test=df_test)

Fit Model

Actually, all the hard work is already out of the way. To fit the model, simply call the method

model = fasttext.train_supervised(input=str(train_filepath))

, and voila! We have a trained model.

Although our model is trained, we don’t know how well our trained model performs. Let’s write a function to return the average accuracy over all label classes:

def obtain_accuracies(model):
train_results = model.test(path=train_filepath)
validation_results = model.test(path=validation_filepath)
test_results = model.test(path=test_filepath)
return train_results, validation_results, test_results

Calling this function with

train_results, validation_results, test_results = obtain_accuracies(model=model)
print(train_results, validation_results, test_results)

Reveals a baseline training accuracy of 71.0%, a validation accuracy of 54.1%, and a test accuracy of 48.4%. Not too shabby for a baseline model!

Also, as mentioned in the introduction, if the AI classifies a “extremely negative” text as “negative”, this counts as a mistake when calculating the accuracy. So many of the errors made by the model are “off-by-one”, meaning the ground truth was one of the adjacent labels.

Hyperparameter Tuning

So far, we have a baseline model that has shown decent results. How can we do better? Luckily, there are several hyperparameters that can be passed to the fasttext.train_supervised method which we might want to try adjusting and see if the model improves its performance.

This is where our validation set comes in handy — we will try out several different sets of hyperparameters and then evaluate the resulting models on the validation set. The set of parameters that results in the highest accuracy on the validation set is the one we will want to use for our final model.

Which sets of hyperparameters should we try? Although one could try to manually set the hyperparameters and try to make improvements ourselves (or use a hyperparameter tuning library such as Optuna), I am too lazy for that and prefer to just use random search.

def create_training_params(baseline: bool = False) -> Dict[str, Any]:
if baseline:
return {}
epoch = random.randint(2, 120)
wordNgrams = random.randint(1, 6)
lr = np.random.uniform(0.02, 0.5)
dim = random.randint(50, 200)
minn = random.randint(0, 5)
maxn = random.randint(0, 5)
minCount = random.randint(1, 5)
ws = random.randint(2, 10)
lrUpdateRate = random.randint(50, 200)
bucket = random.randint(200000, 20000000)

return {
"epoch": epoch,
"wordNgrams": wordNgrams,
"lr": lr,
"dim": dim,
"minn": minn,
"maxn": maxn,
"minCount": minCount,
"lrUpdateRate": lrUpdateRate,
"ws": ws,
"bucket": bucket,
"thread": 12,
}

This function will return a set of randomly selected parameters.

We can try out a single set of random parameters by simply running:

model = fasttext.train_supervised(input=str(train_filepath), **create_training_params())

Let’s add a couple of extra lines to create a more complete hyperparameter tuning pipeline. Below, we declare the number of iterations to search and instantiate the current best_accuracy as the baseline model’s accuracy and the current best params as an empty dict (default params).

Then, we run a loop where each iteration randomly generates parameters, and trains and evaluates the model. If the new accuracy is greater than the previous record, we overwrite the best accuracy and save the best_params for future reference.

iterations = 10
best_accuracy, best_params = validation_results[1], {}
for it in range(iterations):
params = create_training_params()
model = fasttext.train_supervised(input=str(train_filepath), **params)
train_results, validation_results, test_results = obtain_accuracies(model=model)
if validation_results[1] > best_accuracy:
best_accuracy = validation_results[1]
best_params = params
print(f"Best accuracy so far: {best_accuracy}")
print(f"Best params: {best_params}")

By doing so, we were able to obtain a final accuracy of 56.1%!

Conclusion

Digging around the other notebooks using this dataset, I am able to see a Naive Bayes solution with an accuracy of 70%, or a BERT solution obtaining an accuracy of 88%. Clearly, this is much better than what we obtained with fasttext. However, there was a significant amount of text preprocessing applied to the dataset in those implementations, something we had not done.

If the text data is not clean, the fasttext AI might look at garbage data to try to find patterns to label the text. Further steps to improve the performance would be to apply preprocessing methods to the dataset. Furthermore, there are more data columns available in the raw dataframe that may be useful as model input (we only used text and label columns). I will leave these next steps up to you, since the goal of this article is to provide a generic overview of the fasttext library.

We have learned:

  • What is the fasttext library
  • How to preprocess a dataset to be used in fasttext
  • How to fit a baseline model on the data
  • How to hyperparameter tune our model to improve baseline results

Thank you for reading to the end! I hope you find this helpful, and good luck with your text classification tasks.

Links:

Linkedin: https://www.linkedin.com/in/sergei-issaev/

Github: https://github.com/sergeiissaev

Kaggle: https://www.kaggle.com/sergei416

Medium: https://medium.com/@sergei740

Twitter: https://twitter.com/realSergAI

Learn more about Vooban: https://vooban.com/en

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓