Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Tools and Techniques I Used for Cleanlab’s Data-centric AI Competition 2023
Latest   Machine Learning

Tools and Techniques I Used for Cleanlab’s Data-centric AI Competition 2023

Last Updated on July 17, 2023 by Editorial Team

Author(s): Giorgos Papachristoudis

Originally published on Towards AI.

I had so much fun participating in Cleanlab’s Data-centric AI (DCAI) competition! You can read the competition announcement here. This event comprised two distinct contests: one focused on text and the other on images. The shared objective of these contests was to accurately classify data points within a clean test set. However, the training data presented typical real-world challenges, including label errors and outliers.

Being a data-centric competition, the emphasis was not on deploying large, resource-intensive models. Instead, the spotlight was on innovative methods to enhance and cleanse the training data. The data contained two categories of errors: incorrectly labeled points (label errors) and points deviating from the problem’s underlying distribution (outliers).

The philosophy of data-centric approaches embraces simplicity, utilizing straightforward techniques or models to unearth data issues. Once these issues have been identified, the course of action involves either eliminating or, optimally, fixing these examples. In this article, I’m excited to take you behind the scenes of my journey to the top spot in the competition. I will share the unique blend of techniques, tools, and models that became my allies in tackling the challenges. So, let’s explore how we made it to 1st place together!

1. Data Description

The datasets we worked with were of two modalities: a text dataset consisting of reviews from Amazon with an associated ‘star’ rating, and an image dataset consisting of alphanumeric character images. In both competitions, the metric of success was the accuracy of the “clean” test set.

The text data looks like this:

I have been a Maximum PC reader since its beginning and still read it cover to cover. Awesome magazine. Love it, 4

In other words, we have the review and its rating from 1 to 5 (1 being the worst and 5 the best). Unfortunately, a nontrivial chunk of reviews is assigned to the wrong label like this the following:

Love this magazine — the best GF magazine I have seen by far, 1

The goal is to find those reviews and either discard them or correct their rating and retrain the model on cleaner data.

The image data looks like this:

Each image is a 60×60 array of white (value: 1.0) and black (value: 0.0) pixels, representing one of the 26 letters of the English alphabet or one of the 7 digits. Digits 0, 1, and 9 are excluded, most likely so that learning is easier (since digits 0, 1, and 9 can easily be confused with letters o, l, and q, respectively).

2. Text competition

2.1 Baseline model

Since we entered this part of the contest pretty late, we limited our experiments to tree models. Experimenting with different models and using grid hyperparameter grid search, we found that a random forest classifier with 250 trees and 3 as the minimum number of samples in a leaf node performed best.

For a column transformer, we used a numeric and text processor.

For text_features, we used the review text, while for numerical_features we used features that we derived from a clean version of a review — more on this topic later. We experimented with different parameters of TfidfVectorizer (e.g., allowing bigrams, trigrams, increasing the vocabulary size, binarizing the transformation, etc.), but not only did we not observe any gains but we also saw some drop in performance. The baseline model gave us a 62.9% in validation accuracy. From a quick inspection, we believed that this was due to the high presence of mislabeled examples in our training data (we estimated mislabeled reviews to represent 16–20% of the data).

2.2 Handcrafting features

Using the results from the baseline model, we did an extensive study of the most frequent ngrams per rating. For example, for one-star and two-star reviews, we identified ngrams like ‘not as promised’, ‘have not yet’, ‘have to wait’, ‘just a waste’, ‘not a good read’, ‘hate’ etc. For five star reviews, we identified common ngrams like ‘as promised’, ‘like … much’, ‘great deal’, ‘cannot beat’ and so on. To avoid data nuances, we converted all contractions to their full versions (i.e., “won’t” → “will not”). We did this for each of the 5 review ratings and computed the absolute number of times a hand-crafted feature appears in a review. We found this to be indicative of the polarity of the review. For example, the review

Love love my magazine, I get so excited!, 5

is more polarized than the review

I am new to the magazine and love it so far!, 4

In other words, the number of times a keyword appears is linked to the polarity of the review. Lastly, we constructed aggregate features per rating that compute the sum of all keyword occurrences per rating so that we get a rough estimate of how many keywords for each rating appear in a review.

2.3 Cleaning the data

Since the presence of mislabeled data creates a serious bottleneck in our classifier’s performance, the next step was to clean the data. We used cleanlab’s find_label_issues method, which identifies potentially bad labels.

The main two arguments of this method are: the labels (provided as ground truth) and the predicted probabilities per example. For multiclass classification problems and assuming we have K classes, each element in labels takes a value between 0 and K-1, while pred_probs is a NxK probability matrix, where N is the number of examples with each row summing to 1. In other words, pred_probs[i, k] represents the probability the model assigns to example i to belong to class k: P(y[i]=kU+007Cx[i]). Function find_label_issues returns a label quality score between 0 and 1 (0 meaning most likely incorrect, while 1 most likely correct). The idea behind this score is pretty simple: assuming a model has been trained on (somewhat) clean data, it will be capable of predicting the class of an example correctly. More mathematically put, the argmax_{k ∈ 0…K-1} P(y[i]=kU+007Cx[i]) will most likely give us the correct label for example x[i]. In contrast, if an example’s label (as denoted by labels[i]) is wrong, we would have that labels[i] ≠ argmax_{k ∈ 0…K-1} P(y[i]=kU+007Cx[i]).

Function find_label_issues goes a step further and provides a quality score for the label. It provides three methods to score label quality: “self_confidence”, “normalized_margin”, “confidence_weighted_entropy”. The idea behind all three is pretty intuitive. The more confident a model is about the label of an example, the more the probability mass is concentrated around this label. If we assume that example’s i label is k, then“self_confidence” simply returns pred_probs[i, k]. If the label is incorrect (by the model’s understanding of the world), we would expect pred_probs[i, k] to be small. This method, however, does not take into account the confidence of the model wrt to the other classes (labels). Method “normalized_margin” does this by returning the gap between pred_probs[i, k] and max_{k’ k} pred_probs[i, k’]. This method compares the prediction score at the ground truth label against the most confident prediction of the remaining classes. Lastly, method “confidence_weighted_entropy” does something similar to method “normalized_margin”, but takes into account the whole probability mass landscape as represented by the entropy. We see this behavior in thetoy example below:

Both models would identify this review as mislabeled. However, the model on the right is much more confident in doing so.

Even though the review is positive, it is mislabeled as 1-star review. The model on the left is pretty uncertain in its predictions (different ratings receive probabilities anywhere from 10% to 25%), while the one on the right pretty certain (it predicts the label to be 5 with probability 80%). Since the ground-truth label is 1, the quality score wrt “self_confidence” for both models would be pred_probs[i, 1] = 0.1. In other words, find_label_issues method only focuses on the absolute confidence of the classifier regarding the ground-truth label (1 in this example). That is not great, as we completely miss the fact that the leftmost classifier does not give very confident responses overall. In contrast, method “normalized_margin” computes the difference between pred_probs[i, 1] and max_{k’ ≠ 1} pred_probs[i, k’]. In the (left) uncertain model, that is 0.1–0.3 = –0.2, while in the right (confident) model that is 0.1–0.8 = –0.7. You see that under the confident model, this label is much more confidently marked to be incorrect. Something very similar happens when we compute quality scores with the “confidence_weighted_entropy” method.

  • Note #1: In our example above, we derived negative scores when we used the “normalized_margin” method, while we said at the beginning that function find_label_issues returns quality scores between 0 and 1. Don’t worry about that. There’s some rescaling going on in the underlying code, but this rescaling does not alter the underlying process described here!
  • Note #2: When we introduced the find_label_issues method, we made a very optimistic assumption: that the model has been trained on (somewhat) clean data. This is obviously not true in real data, as there various sources of noise. However, to get around this, the documentation in find_label_issues suggests using out-of-sample predicted probabilities as pred_probs. The reason is that out-of-sample probabilities provide a more unbiased estimate of each datapoint, since the data point in question is not used during training. You can use sklearn’s cross_val_predict method to generate out-of-sample probabilities.

Of course, if the folds used for model training have noisy data, cross_val_predict cannot do miracles, but it is still a good starting point to generate out-of-sample probabilities and consequently identify erroneous examples via the find_label_issues method.

2.4 Using pretrained models

In the text part of the competition, we only had ~12.7K examples, where a considerable chunk was mislabeled. This had, as a result, the model to overfit to noise as well. We employed a bert-base-multilingual-uncased model (LiYuan/amazon-review-sentiment-analysis) finetuned on hundreds of thousands of product reviews in six languages. The bert-base-multilingual-uncased model already possesses a very good general understanding of languages. Furthermore, it's finetuning on customer reviews gives it an additional edge in understanding the polarity of a review.

2.4.1 Producing more unbiased out-of-sample probabilities

Since the BERT model has been trained on much more and potentially cleaner data, we computed the zero-shot predictions of the reviews in our data using the pretrained model. The idea here is that the BERT model is finetuned on data that are close enough to our data. Therefore, transfer learning would work. Indeed that was the case. We re-ran find_label_issues this time replacing pred_probs with the output from the pretrained model. A manual inspection of the first 100 examples with the largest issues as identified by cleanlab’s method was really revealing.

Here we show the top 5 mislabeled examples as identified by find_label_issues when pred_probs is computed from a large pretrained model. The rating of these 5 reviews is 1, but the model predicts 5.

The unbiasedness of pred_probs really improves the results of the find_label_issues method. We then removed the most problematic examples flagged by find_label_issues and re-ran our classifier. The results improved considerably. Test accuracy increased to 80–84%.

2.4.2 Using Output of pretrained models as additional features

Lastly, we reached the 89.3% mark by including the output of the pretrained model as additional features along with the numerical handcrafted features and tf-idf features. Since we used a tree-based model, feature scaling was not necessary. We repeated this exercise by using other pretrained models as well. We used the pretrained model cardiffnlp/twitter-roberta-base-sentiment-latest (a RoBERTa model trained on 124M tweets) combined with the LiYuan pretrained model, but we even observed a small drop in performance.

2.5 cleanlab — what has worked and not worked

  • find_label_issues method is extremely helpful! Its usefulness, though, greatly depends on the quality of pred_probs (“garbage in garbage out”).
  • For this reason, we found that the iterative application of find_label_issues method produces better results. In other words, we obtain out-of-sample predictions under a baseline model. We apply find_label_issues, we carefully remove examples with issues (as identified by the former method — in this step, some manual inspection might be necessary), we re-run the model (on cleaner data), we obtain better out-of-sample predictions under this better model, we re-run find_label_issues and repeat until we do not see any meaningful changes in model output.
  • Removing data is strictly worse than correcting labels, since we throw away precious data. However, when classification is subjective, correcting examples can lead to worse performance. In our context, what I consider a 3- or 4-star review, someone else might think it’s a 5-star review. We especially observed this with ratings that are very close together (i.e., 1 vs 2, and 4 vs 5). We saw several cases where the model predicted a rating that was only one off to the ground-truth rating. This was less of an issue for the image competition as the alphanumeric character of an image is more of an objective decision.

Below, we provide a summary of the things we tried and the resulting performance:

3. Image competition

3.1 Baseline model

For a baseline model, we used the following CNN network:

We found that AvgPool2d gave the model more robustness than MaxPool2d. We speculate this is because average pooling computes the average in a window. In many images, the part of the image that corresponds to the character is a continuum of white pixels (value 1.0), while there are randomly placed white and black pixels in other parts. Max pooling does not make a distinction between these two cases, while average pooling does (outputs small values to regions that contain randomly placed white and black pixels, while large values to regions that contain the character). This model has ~351K trainable parameters. We train for 16 epochs and make sure the validation and training splits are stratified so that we maintain the relative class presence. Stratification is very important here as we have many classes (33), and the most prevalent class (digit 2) is ~2 times more frequent than the least common one (letter u). This first baseline gave us a 61.4% in validation accuracy.

3.2 Removing OOD data

In contrast to the text competition, we found an additional source of noise that comes from out-of-distribution (OOD) data. These are data that have nothing to do with the underlying problem, or more formally, they are examples whose data distribution is different from the data generating distribution of the underlying problem. This data need to be removed from the dataset.

For this reason, we used cleanlab’s OutOfDistribution class. The idea behind OutOfDistribution is simple. First, a NearestNeighbors model is built from the training data. Then, we compute the KNN Distance of each data point, defined as the average distance between the data point and each of its K nearest neighbors from the training data. OOD examples tend to be further to their K nearest neighbors than their non-OOD counterparts. Of course, the quality of the output depends on how unbiased the predicted probabilities or the features representing the input (in this case, images) are. Unfortunately, the output of OutOfDistribution was not just OOD examples. This was potentially due to a lack of high-quality representations of the input. In the batch of images identified as OOD by OutOfDistribution, we observe many images (in green color) to be normal.

Titles in green color mean that there’s a match between the prediction and ground truth label, while titles in red that there’s a mismatch. OOD examples are circled in red.

In the figure above, the number inside the square brackets represents the image id, the character next to it is the ground-truth label for this image, and the character inside the parenthesis is the model’s prediction. For example, the image on the upper right corner: [9039] m (pred: w) means that the image id is 9039, the ground-truth label: m, and the prediction: w. This example is also an OOD example. You will notice that it represents a character that does not belong to any of the English alphabet characters or any of the 7 digits.

For this part, we made a lot of manual effort: for the 700 most likely examples to be identified as OOD by cleanlab’s OutOfDistribution class, we inspected them manually and identified 270 truly OOD examples. Removing these OOD examples led to better model performance on test data.

3.3 Identifying mislabeled examples

In this stage, we used cleanlab’s find_label_issues method. We confirmed that the top examples identified by the method indeed had issues. E.g. the batch with the most problematic examples is shown below:

Top examples identified by cleanlab’s find_label_issues method as mislabeled are indeed mislabeled.

3.3.1 Removing mislabeled examples

When we simply remove the mislabeled examples, we go from 8.9K to 7.6K examples. On the one hand, this makes for a better model as it will train on cleaner data. On the other hand, it is more susceptible to overfitting (since it will be trained on less data).

Removing OOD and mislabeled examples leads to an increase in validation accuracy from 61.4% to 70.8%.

3.3.2 Correcting mislabeled examples

Removing examples might result in a cleaner dataset, but it also leads to a smaller training set. Given that most times, the process of acquiring data is very expensive, it is a waste of resources to throw away data. In this stage, we corrected the mislabeled examples with the predicted labels by the model trained on clean data. The validation accuracy increased to 83%!

3.4 Increasing the data size (data augmentation)

As we mentioned in Sec. 3.1, the model has ~351K trainable parameters. With a dataset of less than 10K data points, the model is hard to not overfit to the training data. Having inspected the data, we realized that each alphanumeric character shows returning patterns of noise. The character is either injected with black pixel noise, rotated or shifted. We tried several different image transformations (as shown below): from simple random rotation (rnd_rotation_tx_01), to random cropping (rnd_crop_tx_03) to more composite transformations like random rotation followed by black pixel noise injection (rnd_rotation_and_black_pixel_noise_tx_10) and so on.

After extensive experimentation, we only resorted to the following image transformations rnd_rotation_tx_01, center_crop_tx_04, black_pixel_noise_tx_07, rnd_rotation_and_black_pixel_noise_tx_10. In fact, we found that creating data from transformations not relevant to the underlying data distribution (like gaussian_blur_tx_06) hurts performance. To give an idea of what an image transformation looks like, this is an example of an image where we apply a random rotation followed by injecting black pixels at random locations at varying degrees of noise.

Original image on the left. The remaining six images are generated from the original one after we apply random rotation and varying degrees of black pixel noise.

With this process, we can create as many examples as needed. In fact, we experimented with augmented sets of size anywhere between 40K and 200K. With this size, we increased validation accuracy to 98.2% and test accuracy to 99.8%.

3.5 Finetuning large pretrained models

In all of our experiments so far, we used a relatively medium-sized CNN (~351K parameters) trained from scratch and achieved pretty decent results. As a last step, we tried a large pretrained ResNet50 model. The architecture of a ResNet50 model is shown below.

ResNet50 model architecture.

This model consists of ~24M parameters. To finetune it to our data, we had to do some small modifications (since the original ResNet50 has been trained on RGB images, while our data were binary images) and apply one MLP layer to project the model to our output space of 33 alphanumeric characters.

We actually noticed that this much bigger model showed slightly better performance than the simpler convolutional network we used, but it was (as expected) several orders of magnitudes slower and could only run it on a GPU — we used a free Google Colab instance for this).

Below, we provide a summary of the things we tried and the resulting performance:

4. Conclusion

The philosophy behind data-centric approaches is to focus on cleaning/augmenting the data. Data of better quality provide much higher performance to an underlying task than employing much more complicated models. The cleanlab library can prove very helpful in doing that. With routines like find_label_issues and OutOfDistribution, we can identify problematic examples and take action. We found empirically through the contest that an iterative approach gives the best results. In other words, we can identify the first batch of problematic examples with cleanlab using a baseline model, clean the data, use this cleaner version to train a better model, and keep identifying problematic examples as the quality of our model responses improves.

On another note, even though I didn’t use it for the competition, I know that Cleanlab offers a data correction tool called Cleanlab Studio that can automatically find and fix data issues without having to write any code or make all of the manual efforts I had to make in this competition. A final comment is that exploratory data analysis is always key to deep dive into a problem and identifying key patterns. This, in combination with understanding the inner workings of models and data cleaning processes (instead of using them as black boxes), can help you understand their strengths and limitations and point you in the right direction for further improvements.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓