Towards Artificial Intelligence — Overcoming Data Challenges
Last Updated on July 26, 2023 by Editorial Team
Author(s): Ramkumar Hariharan
Originally published on Towards AI.
The many varieties of messy data, and its fixes
“Data Mining is what’s mine is mine and what’s yours is also mine”, Sydney Brenner
From issues of insufficient data and missing values to imbalances and outliers, getting your data Machine Learning (ML) ready is pivotal. Some of these data cleaning or preprocessing steps are limited in scope and are applicable only in certain scenarios or with specific data types. Others are more generic and widely applicable concepts.
I. Size Matters, even in the World of Data Science
When it comes to creating predictive machine learning (ML) models that extrapolate well to new data, you need a decent amount of good, clean data. The size of accessible training data will, in turn, partly determine the type of models you can build — interpretable Vs. backbox, linear Vs. non-linear, the extent of non-linearity achievable, and finally how well its predictions extrapolate to data it has never seen during its training.
With this in mind, it’s sometimes useful to think of data science projects as belonging to distinct “data economic” strata, each set by arbitrary bounds on data size:
In the data world, there are data billionaires, data millionaires but also the penury stricken data “hundred-aires” and data “ten-aires”. Throughout my biomedical Ph. D, I largely belonged to the latter category — a hapless data destitute constantly trying to procure more data
What solutions exist to deal with having barely enough data to train your ML model?
While there’s fuzziness around the definition of “not enough” data, it’s usually better to have a vastly more number of observations (think patients for example) than the number of variables or features (think patient age, sex, height, weight).
A low-hanging fruit kind of next step is to see if there’s more data available. This may well require overcoming hurdles to its access, or costs associated with its generation (say by running more experiments in the laboratory or field). Regardless of the upshot from your additional data gathering pursuits, there’s value in applying data augmentation, especially with images.
For building a classifier or object detection model in computer vision, image augmentation often adds value. There are multiple routes to augment the existing image data: transforming images in multiple ways to capture additional, different points of view, using Generative Adversarial Networks(GANs) to synthesize more data using the existing images as a seed or, sometimes merely copy-pasting the existing images. This can be paired with creative ML model training. Typically, we use some form of Transfer Learning or Fine Tuning to leverage pre-trained Artificial Neural Networks built and optimized on other, unrelated large image data sets. A sterling example of such fine-tunable pre-trained models are the ResNets and several candidates could be found in a model zoo. Several groups have reported record-breaking results with such data augmentation combined with transfer learning approaches.
Would such data augmentation solutions work as well with structured, tabular data? (Structured data has defined features or columns each with a distinct meaning, such as the size of a house, its number of rooms, etc). There’s been some recent research in this field and a few innovative solutions seem to work. However, with tabular data, you have to largely rely on tackling data insufficiency at the level of model training and only making cautiously optimistic claims about the model’s broad applicability.
With structured tabular data that only runs into dozens of rows, the ML models that usually work well are versions of linear models and their regularized cousins such as lasso, ridge, or elastic net.
II. Data and its Imbalance
In one of his recent talks, Andrew Ng emphasized the importance of data quality over quantity. Extending this, prima facie, it may appear you have access to tens of thousands or even millions of data observations. But if most of the data or image files belong to just one kind or class, then only a very small part of the vast, multi-dimensional data universe is getting sampled. With such image datasets, data transformations of the kind we talked about previously, will partially offset the problem.
Structured tabular data, having rows (= data observations) belonging to just one class (we’re talking machine learning classifier models here) will warrant re-adjusting or re-balancing. Several ways exist to tackle this data imbalance beast, the goal being to train the ML model equally with instances from all the different classes.
Thus, under-sampling methods useless rows from the dominant class (say, data from normal, healthy individuals), or oversampling approaches synthesize extra rows of the less represented class (say, data from diabetics in the population). Hybrid techniques use varying proportions of over and under-sampling. The vast majority of these elegant data re-balancing methods are implemented in the python package imbalance-learn. Encouragingly, this package works seamlessly with the most popular python based ML toolbox, scikit-learn, and is under active development.
III. The Missing Value Problem
Structured tabular data with multiple fields or features, can harbor varying proportions of missing or null values. Addressing such gaps in the data, which no ML algorithm will accept as input, often calls for sophistication.
If there are too many missing values, say 75% or more, in a field or row, the straightforward solution is to delete or drop the offending row or column. The caveat being, “if you can afford to do so”. For example, if the column with a large percentage of values missing happens to be important for your prediction or model, then it needs to be salvaged. Such never-to-be deleted data features and those containing only a small proportion of missing values can benefit from careful data imputation.
Sometimes, the reason behind the existence of missing values in some features may be known. For example, people wishing to conceal aspects of their social behavior are more likely than others, to leave parts of the data generating questionnaire unfilled. This knowledge can be used to subjectively fill up such missing value fields.
Several imputations or data filling up methods exist for both categorical and continuous-valued data columns. Missing categorical values are best remedied by mapping the NA’s to a distinct category or level. With continuous values, computing some mean of the remaining values of the training data and using it to impute the missing values is often highly effective.
However, further up the sophistication scale, there are packages that let you predict and impute missing values by applying ML methods on the other features.
IV. Training ML Models with Unlabelled Data
A vastly different challenge is that of no labels. Large datasets running into tens of thousands or even millions of rows may present with no labels. In the absence of the target variable Y label, does training an ML model even make sense? Well, yes and no. In datasets with at least a small amount of labeled data among the sea of unlabelled points, semi-supervised learning is a viable option.
One other way to use unlabelled data, especially if it’s images or text (sentiment prediction), is to solicit annotation or labeling. Crowdsourcing the images or text labeling task on Amazon MTurk can work, as long as budget and data privacy considerations can be met. There’s also an increasing number of companies that offer custom-tailored data labeling services.
A tangential but potentially useful endeavor that can be accomplished with unlabelled data is unsupervised learning. While this seldom yields any target variable predictions, it might well uncover seminal and, insightful inherent structure in the data. The existence of customer groups or clusters in retail datasets illustrates this line of inquiry.
V. Outliers as the Bad Eggs of the Data World
The last but not in any way the least, oft-encountered data challenge is the presence of outliers in the data. These represent the “bad eggs” of the data nest. To throw out these bad eggs, a three-pronged strategy is often effective — (a) detect outliers, (b) try to understand the process that created them, and (c) decide what to do with them.
In the case of univariate or single feature data, the extremely large or small values usually constitute the outliers. There are formal methods to fish for such simple outliers. But with multi-dimensional data, detecting outliers suddenly turns challenging, with no universally accepted consensus on the optimal technique. A reasonable, subjective approach involves compressing the feature columns using Principal Component Analysis (PCA), followed by plotting the first two or three Principal Components to visually spot any outliers.
There also exist standard, statistical methods based on the linear model. Residuals or errors from the model that are several standard deviations away from the mean can point to the presence of outliers. If you are looking for a python based software toolbox to do all these, PyOD offers a meaningful starting point.
Perhaps, the really helpful but at times a little philosophical step is the next one — investigating the source of outliers. For example, in a table of numbers finding nonsensical strings or other strange characters likely represent a data entry error. But, true outliers are not simple measurement errors. Rather, these signify “rarely if ever happens” kind of values, where the cause may not be obvious. A form of outlier detection, called anomaly detection is used in places like identifying fraudulent transactions or fake product listings.
In the third step, what exactly can you do after spotting such a bunch of mysterious outlier values? One obvious solution is to remove them and then proceed with the remaining, hopefully, clean data. Another is to winsorize the extreme values. When you winsorize, choose a threshold value, such as the 90 percentile, and then trim the remaining values by setting them to the highest values within the threshold.
Yet another way to deal with outliers is to pretend they don’t exist and keep using them. For good or for worse, some ML models are less affected by the presence of extreme values than others. Very often, the devil is in the details and it’s hard to prescribe a one-size-fits-all method for finding and dealing with outliers. Cliched and unsurprising, experience is king.
In addition to the above-discussed data-related issues, one may need to sometimes standardize, normalize and transform both tabular and image data, filter on signal to noise ratios, etc. And many of these are specific to images, videos, text, and tables. These alongside their solutions will be discussed in-depth in Part II of this article.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI