Simple Text Classification Using Fasttext
Last Updated on November 5, 2023 by Editorial Team
Author(s): Sergei Issaev
Originally published on Towards AI.
Introduction
Natural language processing is being applied to business use cases at an exponentially higher rate. One of the simplest AI automations that can transform a business is text classification since in many cases, the classification of text data is done manually in a time-consuming process.
Text classification involves developing an AI capable of assigning labels to an input text. These AI models can be trained as a supervised learning problem, adjusting their weights based on previously seen examples with corresponding labels.
To perform text classification in Python, we can use fasttext. Fasttext is an open-source and lightweight Python library capable of quickly and easily creating text classification models.
The dataset I will use for this demo is available here. This dataset consists of Coronavirus related tweets, and associated sentiments. There are five possible classes: extremely negative, negative, neutral, positive and extremely positive. We will use fasttext to build an AI model to classify tweets using this dataset.
One thing to note is that there is a gray area between each of these labels: namely, where is the βlineβ that separates positive from extremely positive? Or βnegativeβ from βextremely negativeβ? There are definitely some tweets that fall in the middle of these two classes, and the ground truth label might be subjective with regard to which side of the line the label ended up on. As a result, it is unlikely that a validation accuracy of 100% will ever be attained.
If you would like to try out the code yourself, the kaggle notebook is available here.
Processing
Below are some of the key preprocessing functions used for preparing data for input into the fasttext model. Note that these steps do not clean or modify the original text in any way β I will focus solely on setting up the data in the proper format for input to the ML model.
- process_data subselects the columns of interest (text and associated label are all we need). For the label, we add the string β__label__β, since fasttext expects the labels to have this prefix.
- split_data splits the data into a train set, a validation set, and a test set.
- save_data_as_txt creates the txt files necessary for fasttext. Fasttext expects a txt file containing text and associated labels, which is what this function creates.
def process_data(df_train: pd.DataFrame, df_test: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
df_train = df_train[["Sentiment", "OriginalTweet"]]
df_test = df_test[["Sentiment", "OriginalTweet"]]
df_train["Sentiment"] = df_train["Sentiment"].apply(lambda x: "__label__" + "_".join(a for a in x.split()))
df_test["Sentiment"] = df_test["Sentiment"].apply(lambda x: "__label__" + "_".join(a for a in x.split()))
return df_train, df_test
def split_data(df_train: pd.DataFrame, df_test: pd.DataFrame, train_fraction: float) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
split_point = round(df_train.shape[0] * train_fraction)
df_train, df_val = df_train.iloc[:split_point], df_train.iloc[split_point:]
return df_train, df_val, df_test
def save_data_as_txt(df_train: pd.DataFrame, df_validation: pd.DataFrame, df_test: pd.DataFrame) -> None:
df_train.to_csv(train_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)
df_validation.to_csv(validation_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)
df_test.to_csv(test_filepath,index=False,sep=" ",header=None,quoting=csv.QUOTE_NONE,quotechar="",escapechar=" ",)
Now, one simply sets some paths and calls the above functions, and we have all the data ready for fasttext to create a model!
df_train = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv", encoding="latin1")
df_test = pd.read_csv("/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv", encoding="latin1")
train_filepath = "df_train.txt"
validation_filepath = "df_val.txt"
test_filepath = "df_test.txt"
df_train, df_test = process_data(df_train=df_train, df_test=df_test)
df_train, df_validation, df_test = split_data(df_train=df_train, df_test=df_test, train_fraction=0.9)
save_data_as_txt(df_train=df_train, df_validation=df_validation, df_test=df_test)
Fit Model
Actually, all the hard work is already out of the way. To fit the model, simply call the method
model = fasttext.train_supervised(input=str(train_filepath))
, and voila! We have a trained model.
Although our model is trained, we donβt know how well our trained model performs. Letβs write a function to return the average accuracy over all label classes:
def obtain_accuracies(model):
train_results = model.test(path=train_filepath)
validation_results = model.test(path=validation_filepath)
test_results = model.test(path=test_filepath)
return train_results, validation_results, test_results
Calling this function with
train_results, validation_results, test_results = obtain_accuracies(model=model)
print(train_results, validation_results, test_results)
Reveals a baseline training accuracy of 71.0%, a validation accuracy of 54.1%, and a test accuracy of 48.4%. Not too shabby for a baseline model!
Also, as mentioned in the introduction, if the AI classifies a βextremely negativeβ text as βnegativeβ, this counts as a mistake when calculating the accuracy. So many of the errors made by the model are βoff-by-oneβ, meaning the ground truth was one of the adjacent labels.
Hyperparameter Tuning
So far, we have a baseline model that has shown decent results. How can we do better? Luckily, there are several hyperparameters that can be passed to the fasttext.train_supervised method which we might want to try adjusting and see if the model improves its performance.
This is where our validation set comes in handy β we will try out several different sets of hyperparameters and then evaluate the resulting models on the validation set. The set of parameters that results in the highest accuracy on the validation set is the one we will want to use for our final model.
Which sets of hyperparameters should we try? Although one could try to manually set the hyperparameters and try to make improvements ourselves (or use a hyperparameter tuning library such as Optuna), I am too lazy for that and prefer to just use random search.
def create_training_params(baseline: bool = False) -> Dict[str, Any]:
if baseline:
return {}
epoch = random.randint(2, 120)
wordNgrams = random.randint(1, 6)
lr = np.random.uniform(0.02, 0.5)
dim = random.randint(50, 200)
minn = random.randint(0, 5)
maxn = random.randint(0, 5)
minCount = random.randint(1, 5)
ws = random.randint(2, 10)
lrUpdateRate = random.randint(50, 200)
bucket = random.randint(200000, 20000000)
return {
"epoch": epoch,
"wordNgrams": wordNgrams,
"lr": lr,
"dim": dim,
"minn": minn,
"maxn": maxn,
"minCount": minCount,
"lrUpdateRate": lrUpdateRate,
"ws": ws,
"bucket": bucket,
"thread": 12,
}
This function will return a set of randomly selected parameters.
We can try out a single set of random parameters by simply running:
model = fasttext.train_supervised(input=str(train_filepath), **create_training_params())
Letβs add a couple of extra lines to create a more complete hyperparameter tuning pipeline. Below, we declare the number of iterations to search and instantiate the current best_accuracy as the baseline modelβs accuracy and the current best params as an empty dict (default params).
Then, we run a loop where each iteration randomly generates parameters, and trains and evaluates the model. If the new accuracy is greater than the previous record, we overwrite the best accuracy and save the best_params for future reference.
iterations = 10
best_accuracy, best_params = validation_results[1], {}
for it in range(iterations):
params = create_training_params()
model = fasttext.train_supervised(input=str(train_filepath), **params)
train_results, validation_results, test_results = obtain_accuracies(model=model)
if validation_results[1] > best_accuracy:
best_accuracy = validation_results[1]
best_params = params
print(f"Best accuracy so far: {best_accuracy}")
print(f"Best params: {best_params}")
By doing so, we were able to obtain a final accuracy of 56.1%!
Conclusion
Digging around the other notebooks using this dataset, I am able to see a Naive Bayes solution with an accuracy of 70%, or a BERT solution obtaining an accuracy of 88%. Clearly, this is much better than what we obtained with fasttext. However, there was a significant amount of text preprocessing applied to the dataset in those implementations, something we had not done.
If the text data is not clean, the fasttext AI might look at garbage data to try to find patterns to label the text. Further steps to improve the performance would be to apply preprocessing methods to the dataset. Furthermore, there are more data columns available in the raw dataframe that may be useful as model input (we only used text and label columns). I will leave these next steps up to you, since the goal of this article is to provide a generic overview of the fasttext library.
We have learned:
- What is the fasttext library
- How to preprocess a dataset to be used in fasttext
- How to fit a baseline model on the data
- How to hyperparameter tune our model to improve baseline results
Thank you for reading to the end! I hope you find this helpful, and good luck with your text classification tasks.
Links:
Linkedin: https://www.linkedin.com/in/sergei-issaev/
Github: https://github.com/sergeiissaev
Kaggle: https://www.kaggle.com/sergei416
Medium: https://medium.com/@sergei740
Twitter: https://twitter.com/realSergAI
Learn more about Vooban: https://vooban.com/en
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI