Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality
Cloud Computing   Data Engineering   Latest   Machine Learning

When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality

Author(s): Richie Bachala

Originally published on Towards AI.

Beyond Scale: Data Quality for AI Infrastructure

The trajectory of AI over the past decade has been driven largely by the scale of data available for training and the ability to process it with increasingly powerful compute & experimental models. From speech recognition breakthroughs to large-scale language models, the story of AI is fundamentally a story of data.

The Scaling Hypothesis: Bigger Data, Better AI?

I’ll say it again — the story of artificial intelligence over the past decade is fundamentally a story about data.

What began as a series of experiments in speech recognition has evolved into an understanding of how AI systems learn and grow.

The key insight? Scale matters, but quality matters more.

Early AI researchers discovered something remarkable: when you feed neural networks more data, they continue to improve. This wasn’t just true for speech recognition — it held across language processing, computer vision, and even mathematical reasoning. This observation led to what we now call the Scaling Hypothesis.

Think of it as a perfectly balanced chemical reaction. Three critical ingredients must scale together:

Larger neural networks

More training data

Increased computing power

If any one ingredient falls short while the others grow, progress stalls. But when scaled in harmony, AI capabilities expand exponentially.

  • Analogous to a chemical reaction where all reagents must scale proportionally
  • Scaling laws observed across multiple domains (language, images, video, math)

Source: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models”

https://arxiv.org/pdf/2001.08361

  • First formal study documenting empirical scaling laws
  • Published by OpenAI

The Data Quality Conundrum

Not all data is created equal. The internet may offer trillions of words, but much of it is:

  • Repetitive content
  • SEO-optimized fluff
  • AI-generated text
  • Low-value information

This has led to concerns about whether AI will eventually run out of useful training data. Question I hear being asked often in many podcasts to various product leads in AI.

Data Quality First

Source: Reddit

In my opinion —

  • Prioritize data quality before AI model selection
  • Ensure data readiness for intended use cases
  • Validate data compatibility with chosen AI solutions

Check this fun Thread to follow on reddit to see why — it covers & cuts across many industries — Banks, AdTech, Mfg, BigTech, Data sellers etc.

Source: https://old.reddit.com/r/dataengineering/comments/1f5lqfi/how_serious_is_your_org_about_data_quality/

Role of High-Quality Data

One potential solution I have everyone sharing is synthetic data generation. Models can be trained to generate high-quality data from scratch, effectively creating new learning materials that reflect the underlying distribution of real-world knowledge. This approach has been successful in domains like chess — where AI agents learn by playing against themselves, achieving training without human-generated data or usage/observability/player log data.

Another promising approach is reinforcement learning and reasoning models, which allow AI to improve by reflecting on its own thought processes. This method not only expands the available training data but also enhances model efficiency and problem-solving abilities.

I’ve been a Data Engineering guy for the last decade, so my solution for bad data is immediately a technical solution like below — more cleaning scripts, better validation rules, improved monitoring dashboards.

The Technical Reflex

Picture this common scenario: Bad data appears in a report. Our immediate response?

  • Write a cleaning script
  • Add validation rules
  • Create monitoring alerts
  • Build data quality dashboards

Like these:

The process

DQ dimensions that I’ve been prioritizing —

DQ Dimensions: 9 that are important for my Enterprise
DQ Standard Dashboard for sharing updates to show progress. by Author for Illustrative purpose.

The Enterprise Reality Check

Here’s the truth that I’ve learn the hard way: The best technical solution can’t fix a process problem.

Consider these common scenarios:

  • A perfect validation script can’t fix inconsistent data entry practices
  • The most robust ETL pipeline can’t resolve disagreements about business rules
  • Real-time quality monitoring can’t replace clear data ownership.

Path to Maturity

– in data engineering often looks like this:

  1. Junior: “I’ll fix it with code”
  2. Mid-level: “I’ll build a system to prevent it”
  3. Senior: “Let’s understand why this happens”
  4. Lead: “We need to change how we work”
Image by Author

The best technical solution can’t fix a broken process.

Why Technical Band-Aids Fail

These solutions work… until they don’t. And here’s why:

They treat symptoms, not causes

  • Clean data today, same issues tomorrow
  • Growing maze of scripts and rules

They miss the human element

  • Data entry remains error-prone
  • Business processes stay broken
  • Communication gaps persist
  • No ownership of quality

The Limits of Data: Are We Nearing a Ceiling?

A pressing question in AI development is whether we will hit a ceiling due to data limitations. Some argue that while scaling has driven progress so far, we may eventually exhaust high-quality training data, leading to diminishing returns. Others believe that innovations in reasoning models, reinforcement learning, and self-supervised learning will continue pushing the boundaries of AI capabilities.

Another challenge is data integration and consistency. Large AI models require data that is well-structured, diverse, and representative of real-world complexity. Many enterprises including ours struggle with fragmented data sources, inconsistencies, and lack of proper governance, which hinder AI implementation let alone performance.

Additionally, the computing costs associated with handling vast amounts of data remain a significant factor. AI model training requires extensive computational resources, with companies investing billions in AI clusters. Managing these costs efficiently is crucial to sustaining AI advancements.

Contextual Data Integration

  • Combine different data sources meaningfully
  • Focus on creating value through data relationships
  • Maintain privacy while leveraging data connections

Additionally, concerns about bias in AI models stem from biased training data. If historical data contains systemic biases, AI models will learn and reinforce these biases unless explicitly corrected. Techniques such as bias mitigation, fairness auditing, and explainability methods are essential to ensure that AI systems remain equitable and trustworthy.

Another growing concern is AI-generated data pollution — as AI models generate more content, the internet is becoming saturated with synthetic text. Ensuring that future models are trained on high-quality, human-authored content rather than AI-generated noise is a challenge that must be addressed.

Looking Ahead

In my opinion, if the current trends continues, AI models will soon reach and exceed human-level performance in many professional domains. However, the next phase of AI development will depend not just on increasing model size but on innovative ways to utilize and refine data. Whether through synthetic data, better data curation, or novel foundational architectures with embedded flexible processes, the relationship between data and AI will remain at the heart of progress.

Path Forward

Key Lessons

1. Start with Process, Not Code

Technical solutions should enable good processes, not compensate for bad ones. Before opening your IDE, ask:

  • Who owns this data?
  • Why is quality breaking down?
  • Where does the problem really start?
  • What business processes are involved?

2. Build Bridges, Not Just Solutions

Sustainable quality requires both technical and organizational changes

  • Talk to business users
  • Understand their workflows
  • Learn their pain points
  • Make them partners in solutions

3. Create Sustainable Systems

Your value as a data engineer grows when you bridge technical and business needs.

  • Document processes, not just code
  • Train people, not just implement tools
  • Build accountability frameworks
  • Establish ownership

In Enterprises, sometimes the best solution isn’t more code — it’s better processes.

Remember: Your technical skills are still crucial, but they’re most powerful when applied in support of well-designed processes and clear organizational responsibilities.

Useful links: The “Who Does What” Guide To Enterprise Data Quality | by Michael Segner | Towards Data Science | Medium

The future of AI depends not just on having more data, but on having better data. Organizations that prioritize data quality and build flexible data infrastructure will be best positioned to leverage AI’s potential.

The organizations that master these elements will be the ones that lead in the AI-driven future. In the end, the story of AI isn’t just about algorithms or computing power — it’s about the quality of the data that drives them.

Thanks for reading.

https://x.com/richiebachala

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓