Evaluating Synthetic Data using Machine Learning
Last Updated on January 7, 2023 by Editorial Team
Author(s): Varatharajah Vaseekaran
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Evaluating Synthetic Data Using MachineΒ Learning
Adversarial validation to evaluate synthetic data
βPoor accuracy scoreβ is a phrase that might cause nightmares for many data science professionals when building machine learning models for classification problems. However, a poor accuracy score can be a blessing in rare scenarios, especially when performing adversarial validation.
Adversarial validation is conducted to evaluate if two datasets come from the same distribution or not. Generally, it is used to measure data shifts in training and test sets used in a machine learning problem. Similarly, adversarial validation can be used to evaluate the quality of the synthetic data.
Synthetic Data: AnΒ Overview
In simple terms, synthetic data are artificially generated data that mathematically and statistically represent real-world data. Synthetic data are created using algorithms (e.g., SMOTE, ADASYN, Variational Autoencoders, GANs, etc.) and can be used as a substitute for real-world data when performing data analysis and building machine learningΒ models.
Synthetic data enables data privacy, as it masks sensitive information, and therefore, synthetic data is invaluable in the financial and medical sectors. Synthetic data also saves the cost and human labor needed to gather, process, and label massive datasets, as a few well-labeled data can be used to generate mounds of synthetic data.
Since there are many ways to generate synthetic data, there needs to be a proper evaluation method to measure the quality of synthetic data relative to real-world data. For such evaluation, this article focuses on adversarial validation.
You can get more insights on synthetic data and on how to generate synthetic data using powerful and open-sourced GAN implementations by referring to thisΒ article:
GANs for Synthetic Data Generation
Introduction to Adversarial Validation
Machine learning has many exciting and innovative applications: from detecting cats and dogs to accurately highlighting tumors in MRI images. In this article, weβll have a look at how machine learning can be used to determine the similarities between two datasets, i.e., adversarial validation.
The theory behind adversarial validation is quite simple: a classification model is trained to distinguish between two datasets, i.e., the train and test sets. Labels are created for each data, indicating whether the data is from the train set or not, and the new labels are used as targets for training theΒ model.
In a general classification problem, high accuracy indicates the model is performing well. But for adversarial validation, a lower accuracy score indicates better performance. A lower accuracy score means that the model is having problems distinguishing between the two classes of data (from the training set or not). It shows that the distribution of the train set and the test set are similar to each other. If high accuracy is obtained, the model has no problems in distinguishing between the train and the test set, therefore, it can be concluded that both the train and test set have different distributions.
Evaluating Synthetic Data with Adversarial Validation
Adversarial validation can be used to determine the quality of the synthetic data as well. Instead of using a train and test set, the real data and the synthetic data are used to train a machine-learning model. If the model performs poorly, it indicates that the synthetic data and the real data have similar properties, and if the model performs exceptionally well, then it shows that the real data and the synthetic data are completely different from eachΒ other.
For this experiment, the popular algorithm that is used for generating synthetic data, SMOTE (Synthetic Minority Over-sampling Technique), is used, and the generated data, together with the real data, would be used to train theΒ model.
The Nearest Earth Objects dataset is used to train the model. A relatively simple dataset comprising details of the diameter, distance from Earth, miss distance, etc., of asteroids that are verified by NASA. Each asteroid is labeled whether itβs hazardous to Earth, and only the hazardous asteroids are considered for this experiment.
Initially, the data is loaded, and the unnecessary fields of the data areΒ removed.
Since all the features of the data are numerical, RobustScaler is used to scale theΒ data.
The pre-processed data is used to generate synthetic data. The minority data (that is, the hazardous asteroids) are used to generate synthetic data using SMOTE. After generating synthetic data, the non-hazardous asteroids areΒ removed.
Now the hazardous label is removed, and a separate data frame is created, which consists only of the synthetic data.
A new label (is_synth) is created for the real data, which consists only of hazardous asteroids, and the synthetic data. This label indicates whether a particular row of data is synthetic or not. Then, both the synthetic and real data are merged to create the final trainingΒ data.
After creating the final data, the data is split to train and test sets, scaled, and then a classifier is trained on the training set. LightGBM model, a powerful Gradient Boosting model library, is selected as the classification model. Once the model is trained, the test set is used to evaluate the performance of theΒ model.
The model is performing with an accuracy of 68.67%. This indicates that the model is having some trouble classifying which data is synthetic or real. Therefore, it can be concluded that the SMOTE algorithm is working well for hazardous asteroid data in generating synthetic examples.
The repository for the workings of this article can be foundΒ here.
Final Words
In the current data-centric AI development, synthetic data is of utmost importance. There are many tools and libraries available to generate synthetic data. However, evaluating the quality of the generated data can be problematic, and this can be solved by performing adversarial validation.
Adversarial validation is generally performed to evaluate the data shift between the training data and the data at inference. This article provides a practical implementation of using adversarial validation to determine the quality of the synthetic data with respect to the realΒ data.
We build a machine-learning classification model using synthetic data and real data. New labels are created for the data, stating whether a particular data is synthetic or not. During the evaluation, if the model performs well (having a high score), it means that the model can clearly separate the real and synthetic data. Therefore the synthetic data is different compared to the real data, and if the model has a poor score, it can be concluded that the synthetic data and the real data areΒ similar.
I hope you have learned a relatively simple method to evaluate synthetic data. I hope you enjoyed the article, and I would love to hear your feedback about this article, as it would help me to improve.Β Cheers!
Evaluating Synthetic Data using Machine Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI