Six warnings You Ignore That Might Put Image Classification Dataset at risk
Last Updated on October 22, 2021 by Editorial Team
Author(s): Gaurav Sharma
Deep Learning
βOpportunity never knocks twice,β as the saying goes, but in the hands of image annotators, this clear-cut leaflet will assist the data scientists in addressing gaps in the training datasets that were left neglected or disregarded throughout the image cleaningΒ process.
The sole obligation of an image annotator working on an image classification assignment is not just to complete the picture labelling task at hand. But also to tell data scientists about the following alarms, which, if not handled immediately, may present unanticipated dangers in the datasets.
1. There is an excessive amount of βduplicationβ
Duplication basically indicates that there are a lot of pictures in the dataset that are repeating/reoccurring in the same class/classes across theΒ dataset.
It might be due to a variety of factors, such as the data scientist scraping the same homepage with photos numerous times or the identical photographs being available on two distinct webΒ pages.
Alternatively, the open dataset that the data scientist gave to the labelling team for the custom labels was not properlyΒ cleaned.
Whatever the reason, repeated images make it difficult for a Data Scientistβs Machine learning model to generalize because it is always learning the same information.
2. Images that are fuzzy, unless the entire dataset isΒ blurry.
When dealing with a computer vision use case, the Machine Learning model will be unable to extract prescriptive information or characteristics about the item of interest from fuzzy or pixelated pictures due to a lack of visualΒ clarity.
As a result, labellers must tell the Data Scientist about the situation and allow them to take the appropriate action.
But hereβs the catch: if the whole dataset is fuzzy, itβs possible that the Data Scientist is working on a production use case that necessitates image blurriness; in that instance, just confirm with the Data Scientist.
3. There are too many instances that areΒ unclear.
The quality of the inputs supplied to any Machine Learning model for learning a certain task is the modelβs advantage.
If the Data Scientist gives the Annotation team a dataset with too many ambiguous instances, such as those seen in the figureΒ below.
The data labellers just need to express their concerns to the Data Scientist and question him or her about the next best set of instructions.
4. Bias in the dataset toward a specificΒ class.
This is the warning in which data labellers must use extreme vigilance.
That is why, while labelling image classification datasets or any other computer vision dataset, data labellers should keep this inΒ mind.
If people see that one class has an excessive number of images in comparison to other classes/classes.
They must then notify the Data Scientists team as soon as possible. Otherwise, this dataset will be used to create a Machine Learning Model that favours the class with the most pictures in the dataset over the other class/classes.
In other words, the Machine Learning Model will favour that specific class. Following the implementation of that AI Model, might result in a loss of income or public relations setback.
5. The item of interest or the class to be labelled appears to beΒ blurry.
This situation is more commonly seen at the class level than at the picture level. As a result, while doing the picture labelling task.
If the data labeller notices that the object of interest or the class(es) to be classified in the dataset seems hazy or indistinct across theΒ image.
Then they should simply notify the Data Scientist about it and seek his or her opinion on how toΒ proceed.
The Data Scientists team may decide to replace or delete the pictures from the ongoing collection.
6. The designated object of interest or class is only partially visible.
βHalf Knowledge is Dangerous,β as the saying goes, and this is true for every computer vision datasets in the world. If the image is not clearly visible then it might hamper the overallΒ result
In this case, the image annotator should notify the Data Scientist. So that she or he may take the necessary steps to address these types of missing context pictures in their Image Classification Dataset.
EndNote
Iβm hoping that the next time a Data Scientist assigns an Image Classification assignment, he or she will pass on the information of these signals to their data annotation team. It will ultimately assist various organizationsβ Machine Learning teams in developing Datasets that offer a genuine and full image of the Objects of Interest. Cogito Tech LLC provides accurate and quality training datasets for ML and AIΒ models.
Six warnings You Ignore That Might Put Image Classification Dataset at risk was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI