Evaluating Synthetic Data Using Machine Learning

Last Updated on July 25, 2023 by Editorial Team

Author(s): Varatharajah Vaseekaran

Originally published on Towards AI.

Adversarial validation to evaluate synthetic data

“Poor accuracy score” is a phrase that might cause nightmares for many data science professionals when building machine learning models for classification problems. However, a poor accuracy score can be a blessing in rare scenarios, especially when performing adversarial validation.

Adversarial validation is conducted to evaluate if two datasets come from the same distribution or not. Generally, it is used to measure data shifts in training and test sets used in a machine learning problem. Similarly, adversarial validation can be used to evaluate the quality of the synthetic data.

Synthetic Data: An Overview

In simple terms, synthetic data are artificially generated data that mathematically and statistically represent real-world data. Synthetic data are created using algorithms (e.g., SMOTE, ADASYN, Variational Autoencoders, GANs, etc.) and can be used as a substitute for real-world data when performing data analysis and building machine learning models.

Synthetic data enables data privacy, as it masks sensitive information, and therefore, synthetic data is invaluable in the financial and medical sectors. Synthetic data also saves the cost and human labor needed to gather, process, and label massive datasets, as a few well-labeled data can be used to generate mounds of synthetic data.

Since there are many ways to generate synthetic data, there needs to be a proper evaluation method to measure the quality of synthetic data relative to real-world data. For such evaluation, this article focuses on adversarial validation.

You can get more insights on synthetic data and on how to generate synthetic data using powerful and open-sourced GAN implementations by referring to this article:

GANs for Synthetic Data Generation

A practical guide to generating synthetic data using open-sourced GAN implementations.

pub.towardsai.net

Introduction to Adversarial Validation

Machine learning has many exciting and innovative applications: from detecting cats and dogs to accurately highlighting tumors in MRI images. In this article, we’ll have a look at how machine learning can be used to determine the similarities between two datasets, i.e., adversarial validation.

The theory behind adversarial validation is quite simple: a classification model is trained to distinguish between two datasets, i.e., the train and test sets. Labels are created for each data, indicating whether the data is from the train set or not, and the new labels are used as targets for training the model.

In a general classification problem, high accuracy indicates the model is performing well. But for adversarial validation, a lower accuracy score indicates better performance. A lower accuracy score means that the model is having problems distinguishing between the two classes of data (from the training set or not). It shows that the distribution of the train set and the test set are similar to each other. If high accuracy is obtained, the model has no problems in distinguishing between the train and the test set, therefore, it can be concluded that both the train and test set have different distributions.

Evaluating Synthetic Data with Adversarial Validation

Adversarial validation can be used to determine the quality of the synthetic data as well. Instead of using a train and test set, the real data and the synthetic data are used to train a machine-learning model. If the model performs poorly, it indicates that the synthetic data and the real data have similar properties, and if the model performs exceptionally well, then it shows that the real data and the synthetic data are completely different from each other.

For this experiment, the popular algorithm that is used for generating synthetic data, SMOTE (Synthetic Minority Over-sampling Technique), is used, and the generated data, together with the real data, would be used to train the model.

The Nearest Earth Objects dataset is used to train the model. A relatively simple dataset comprising details of the diameter, distance from Earth, miss distance, etc., of asteroids that are verified by NASA. Each asteroid is labeled whether it’s hazardous to Earth, and only the hazardous asteroids are considered for this experiment.

Initially, the data is loaded, and the unnecessary fields of the data are removed.

Since all the features of the data are numerical, RobustScaler is used to scale the data.

The pre-processed data is used to generate synthetic data. The minority data (that is, the hazardous asteroids) are used to generate synthetic data using SMOTE. After generating synthetic data, the non-hazardous asteroids are removed.

Now the hazardous label is removed, and a separate data frame is created, which consists only of the synthetic data.

A new label (is_synth) is created for the real data, which consists only of hazardous asteroids, and the synthetic data. This label indicates whether a particular row of data is synthetic or not. Then, both the synthetic and real data are merged to create the final training data.

After creating the final data, the data is split to train and test sets, scaled, and then a classifier is trained on the training set. LightGBM model, a powerful Gradient Boosting model library, is selected as the classification model. Once the model is trained, the test set is used to evaluate the performance of the model.

The model is performing with an accuracy of 68.67%. This indicates that the model is having some trouble classifying which data is synthetic or real. Therefore, it can be concluded that the SMOTE algorithm is working well for hazardous asteroid data in generating synthetic examples.

The repository for the workings of this article can be found here.

Final Words

In the current data-centric AI development, synthetic data is of utmost importance. There are many tools and libraries available to generate synthetic data. However, evaluating the quality of the generated data can be problematic, and this can be solved by performing adversarial validation.

Adversarial validation is generally performed to evaluate the data shift between the training data and the data at inference. This article provides a practical implementation of using adversarial validation to determine the quality of the synthetic data with respect to the real data.

We build a machine-learning classification model using synthetic data and real data. New labels are created for the data, stating whether a particular data is synthetic or not. During the evaluation, if the model performs well (having a high score), it means that the model can clearly separate the real and synthetic data. Therefore the synthetic data is different compared to the real data, and if the model has a poor score, it can be concluded that the synthetic data and the real data are similar.

I hope you have learned a relatively simple method to evaluate synthetic data. I hope you enjoyed the article, and I would love to hear your feedback about this article, as it would help me to improve. Cheers!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Evaluating Synthetic Data Using Machine Learning

Author(s): Varatharajah Vaseekaran

Adversarial validation to evaluate synthetic data

GANs for Synthetic Data Generation

A practical guide to generating synthetic data using open-sourced GAN implementations.

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Evaluating Synthetic Data Using Machine Learning

Author(s): Varatharajah Vaseekaran

Adversarial validation to evaluate synthetic data

GANs for Synthetic Data Generation

A practical guide to generating synthetic data using open-sourced GAN implementations.

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement