Evaluating Synthetic Data using Machine Learning
Last Updated on January 7, 2023 by Editorial Team
Author(s): Varatharajah Vaseekaran
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Evaluating Synthetic Data Using Machine Learning
Adversarial validation to evaluate synthetic data
“Poor accuracy score” is a phrase that might cause nightmares for many data science professionals when building machine learning models for classification problems. However, a poor accuracy score can be a blessing in rare scenarios, especially when performing adversarial validation.
Adversarial validation is conducted to evaluate if two datasets come from the same distribution or not. Generally, it is used to measure data shifts in training and test sets used in a machine learning problem. Similarly, adversarial validation can be used to evaluate the quality of the synthetic data.
Synthetic Data: An Overview
In simple terms, synthetic data are artificially generated data that mathematically and statistically represent real-world data. Synthetic data are created using algorithms (e.g., SMOTE, ADASYN, Variational Autoencoders, GANs, etc.) and can be used as a substitute for real-world data when performing data analysis and building machine learning models.
Synthetic data enables data privacy, as it masks sensitive information, and therefore, synthetic data is invaluable in the financial and medical sectors. Synthetic data also saves the cost and human labor needed to gather, process, and label massive datasets, as a few well-labeled data can be used to generate mounds of synthetic data.
Since there are many ways to generate synthetic data, there needs to be a proper evaluation method to measure the quality of synthetic data relative to real-world data. For such evaluation, this article focuses on adversarial validation.
Introduction to Adversarial Validation
Machine learning has many exciting and innovative applications: from detecting cats and dogs to accurately highlighting tumors in MRI images. In this article, we’ll have a look at how machine learning can be used to determine the similarities between two datasets, i.e., adversarial validation.
The theory behind adversarial validation is quite simple: a classification model is trained to distinguish between two datasets, i.e., the train and test sets. Labels are created for each data, indicating whether the data is from the train set or not, and the new labels are used as targets for training the model.
In a general classification problem, high accuracy indicates the model is performing well. But for adversarial validation, a lower accuracy score indicates better performance. A lower accuracy score means that the model is having problems distinguishing between the two classes of data (from the training set or not). It shows that the distribution of the train set and the test set are similar to each other. If high accuracy is obtained, the model has no problems in distinguishing between the train and the test set, therefore, it can be concluded that both the train and test set have different distributions.
Evaluating Synthetic Data with Adversarial Validation
Adversarial validation can be used to determine the quality of the synthetic data as well. Instead of using a train and test set, the real data and the synthetic data are used to train a machine-learning model. If the model performs poorly, it indicates that the synthetic data and the real data have similar properties, and if the model performs exceptionally well, then it shows that the real data and the synthetic data are completely different from each other.
For this experiment, the popular algorithm that is used for generating synthetic data, SMOTE (Synthetic Minority Over-sampling Technique), is used, and the generated data, together with the real data, would be used to train the model.
The Nearest Earth Objects dataset is used to train the model. A relatively simple dataset comprising details of the diameter, distance from Earth, miss distance, etc., of asteroids that are verified by NASA. Each asteroid is labeled whether it’s hazardous to Earth, and only the hazardous asteroids are considered for this experiment.
Initially, the data is loaded, and the unnecessary fields of the data are removed.
Since all the features of the data are numerical, RobustScaler is used to scale the data.
The pre-processed data is used to generate synthetic data. The minority data (that is, the hazardous asteroids) are used to generate synthetic data using SMOTE. After generating synthetic data, the non-hazardous asteroids are removed.
Now the hazardous label is removed, and a separate data frame is created, which consists only of the synthetic data.
A new label (is_synth) is created for the real data, which consists only of hazardous asteroids, and the synthetic data. This label indicates whether a particular row of data is synthetic or not. Then, both the synthetic and real data are merged to create the final training data.
After creating the final data, the data is split to train and test sets, scaled, and then a classifier is trained on the training set. LightGBM model, a powerful Gradient Boosting model library, is selected as the classification model. Once the model is trained, the test set is used to evaluate the performance of the model.
The model is performing with an accuracy of 68.67%. This indicates that the model is having some trouble classifying which data is synthetic or real. Therefore, it can be concluded that the SMOTE algorithm is working well for hazardous asteroid data in generating synthetic examples.
The repository for the workings of this article can be found here.
In the current data-centric AI development, synthetic data is of utmost importance. There are many tools and libraries available to generate synthetic data. However, evaluating the quality of the generated data can be problematic, and this can be solved by performing adversarial validation.
Adversarial validation is generally performed to evaluate the data shift between the training data and the data at inference. This article provides a practical implementation of using adversarial validation to determine the quality of the synthetic data with respect to the real data.
We build a machine-learning classification model using synthetic data and real data. New labels are created for the data, stating whether a particular data is synthetic or not. During the evaluation, if the model performs well (having a high score), it means that the model can clearly separate the real and synthetic data. Therefore the synthetic data is different compared to the real data, and if the model has a poor score, it can be concluded that the synthetic data and the real data are similar.
I hope you have learned a relatively simple method to evaluate synthetic data. I hope you enjoyed the article, and I would love to hear your feedback about this article, as it would help me to improve. Cheers!
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI