Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Innovations in Analytics: Elevating Data Quality with GenAI
Data Engineering   Latest   Machine Learning

Innovations in Analytics: Elevating Data Quality with GenAI

Last Updated on October 31, 2024 by Editorial Team

Author(s): Jonas Dieckmann

Originally published on Towards AI.

Data analytics has become a key driver of commercial success in recent years. The ability to turn large data sets into actionable insights can mean the difference between a successful campaign and missed opportunities. However, data quality is still a major challenge: if the data that is fed into a model lacks quality/consistency, the resulting output will also be of low quality. This is well exemplified by the popular saying β€œgarbage-in, garbage-out”.

Image Credits: Pixabay

Although AI is often in the spotlight, the focus on strong data foundations and effective data strategies is often overlooked. In this article, we’ll explore how AI can directly improve these foundations through:

  1. Automating data harmonization
  2. Dynamic labeling and classification
  3. Generating synthetic data

Rather than dealing with flawed data, we’re using GenAI to enhance data quality from the start. This approach also sets the stage for more effective AI applications later on.

The rise of (Generative) AI

Many industries are undergoing significant changes thanks to AI technologies. In marketing, for example, AI helps organizations extract actionable insights from vast data sets, leading to targeted campaigns and better customer engagement. According to Gartner’s Hype Cycle, GenAI is at the peak, showcasing its potential to transform analytics.ΒΉ

Hype Cycle for Emerging Technologies 2023 (source: Gartner)

Despite AI’s potential, the quality of input data remains crucial. Inaccurate or incomplete data can distort results and undermine AI-driven initiatives, emphasizing the need for clean data. For marketers and digital innovators, handling inconsistent data from various sources can be a major barrier to unlocking AI’s full potential.

Flipping the paradigm: Using AI to enhance data quality

What if we could change the way we think about data quality? Instead of seeing it as a prerequisite for using AI, we could use AI to improve data quality itself. By leveraging GenAI, we can streamline and automate data-cleaning processes:

Clean data to use AI? Clean data through GenAI!

Three ways to use GenAI for better data

Improving data quality can make it easier to apply machine learning and AI to analytics projects and answer business questions. Here are three ways to use ChatGPTΒ² to enhance data foundations:

#1 Harmonize: Making data cleaner through AI

A core challenge in analytics is maintaining data quality and integrity. Algorithms can automatically clean and preprocess data using techniques like outlier and anomaly detection. GenAI can now assist in direct data mapping and cleaning by identifying and fixing inconsistencies.

For example, a healthcare organization aggregating market data from different sources might face issues with varying naming conventions.

Example prompt use case #1. Image by author

GenAI can automatically detect and correct these discrepancies, resulting in a clean and reliable mapping dataset. This not only saves analysts significant time on manual data checks but also obviates the need for complex regular expressions with traditional methods.

GPT-4o mini response use case #1. Image by author

#2 Label: Enabling the use of previously unusable data

Organizations often have large amounts of data that are unused due to low quality or lack of labeling. GenAI can help by automatically clustering similar data points and inferring labels from unlabeled data, obtaining valuable insights from previously unusable sources.

Natural Language Processing (NLP) is an example of where traditional methods can struggle with complex text data. For instance, extracting numeric details from clinical sector articles can be misleading if the numbers don’t refer to actual quantities. GenAI prompts can address such challenges effectively.

Example prompt use case #2. Image by author

The result is straightforward but accurate in this case. Numeric extraction is just one example of how labeling can be powerful. Clearly, GenAI is a strong tool for extracting precise details or classifications from text data.

GPT-4o mini response use case #2. Image by author

#3 Generate: Use of LLMs to generate sample data

GenAI can also generate synthetic data to train AI models. Large Language Models (LLMs) can produce realistic sample data, helping address data scarcity in fields where data availability is limited.

For example, a pharmaceutical company developing a drug for a niche market can use LLMs to create synthetic patient profiles, medical histories, and treatment outcomes. This approach not only enhances data diversity but also alleviates privacy concerns related to sensitive patient data.

Example prompt use case #3. Image by author

This approach not only increases data diversity but also addresses privacy concerns related to sharing sensitive patient information. It can also be extended to other applications, such as targeting audiences for marketing campaigns, creating examples for fraud detection, and more.

GPT-4o mini response use case #3. Image by author

Automating Data Quality Enhancement via APIs

To fully harness GenAI’s potential for improving data quality, it’s crucial to integrate this technology in an automated and seamless way. Manually copying datasets into prompts and processing responses is not practical.

Using APIs, like ChatGPT’s API, within your coding environment can streamline this process by incorporating AI-driven data quality enhancements directly into your workflows. For guidance on using OpenAI’s API with Colab or Databricks, you can refer to my other article. The results from these automated requests can be directly written back to your data storage.

Example processing flow: utilizing databricks to communicate with APIs to improve data. Image by author

Automated harmonization, labeling, and data generation

By establishing data pipelines, organizations can utilize GenAI as new data enters their systems. For instance, when new datasets come in, the API can automatically apply data harmonization algorithms or identify patterns to infer labels. This removes the need for manual data cleaning and preprocessing, freeing up data engineers to concentrate on more valuable tasks. Although GenAI shows great promise, it’s important to recognize data privacy issues with public APIs.

Integrating the API into your data pipeline enables you to generate diverse and realistic datasets directly in your training notebook. The API can also create synthetic data to fill gaps in existing datasets, supporting more robust AI model development. This automated data generation not only speeds up research but also minimizes privacy concerns.

Conclusion

Integrating GenAI APIs into data quality workflows offers a powerful way to automate data cleaning, labeling, and generation. This seamless integration helps organizations fully leverage GenAI’s capabilities without manual intervention, making data management more efficient and improving overall data quality.

In summary, the intersection of AI and data quality marks a significant turning point in analytics. GenAI’s ability to boost data quality and provide actionable insights has the potential to transform the industry. By rethinking traditional approaches and using AI to enhance data, organizations can unlock new opportunities for innovation and growth. As we move forward, it’s clear that the future of analytics will be shaped by those who embrace the power of GenAI.

Jonas Dieckmann – Medium

Read writing from Jonas Dieckmann on Medium. team lead @ philips | passionate about data science, agile work & digital…

medium.com

I hope you find it useful. Let me know your thoughts! And feel free to connect on LinkedIn https://www.linkedin.com/in/jonas-dieckmann/ and/or to follow me here on Medium.

References

[1] Gartner (2023): Hype Cycle for Emerging Technologies
https://www.gartner.com/en/newsroom/press-releases/2023-08-16-gartner-places-generative-ai-on-the-peak-of-inflated-expectations-on-the-2023-hype-cycle-for-emerging-technologies

[2] OpenAI β€” ChatGPT: https://chatgpt.com/

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓