When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality

Author(s): Richie Bachala

Originally published on Towards AI.

Beyond Scale: Data Quality for AI Infrastructure

The trajectory of AI over the past decade has been driven largely by the scale of data available for training and the ability to process it with increasingly powerful compute & experimental models. From speech recognition breakthroughs to large-scale language models, the story of AI is fundamentally a story of data.

The Scaling Hypothesis: Bigger Data, Better AI?

I’ll say it again — the story of artificial intelligence over the past decade is fundamentally a story about data.

What began as a series of experiments in speech recognition has evolved into an understanding of how AI systems learn and grow.

The key insight? Scale matters, but quality matters more.

Early AI researchers discovered something remarkable: when you feed neural networks more data, they continue to improve. This wasn’t just true for speech recognition — it held across language processing, computer vision, and even mathematical reasoning. This observation led to what we now call the Scaling Hypothesis.

Think of it as a perfectly balanced chemical reaction. Three critical ingredients must scale together:

Larger neural networks

More training data

Increased computing power

If any one ingredient falls short while the others grow, progress stalls. But when scaled in harmony, AI capabilities expand exponentially.

Analogous to a chemical reaction where all reagents must scale proportionally
Scaling laws observed across multiple domains (language, images, video, math)

Source: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models”

https://arxiv.org/pdf/2001.08361

First formal study documenting empirical scaling laws
Published by OpenAI

The Data Quality Conundrum

Not all data is created equal. The internet may offer trillions of words, but much of it is:

Repetitive content
SEO-optimized fluff
AI-generated text
Low-value information

This has led to concerns about whether AI will eventually run out of useful training data. Question I hear being asked often in many podcasts to various product leads in AI.

Data Quality First

In my opinion —

Prioritize data quality before AI model selection
Ensure data readiness for intended use cases
Validate data compatibility with chosen AI solutions

Check this fun Thread to follow on reddit to see why — it covers & cuts across many industries — Banks, AdTech, Mfg, BigTech, Data sellers etc.

Source: https://old.reddit.com/r/dataengineering/comments/1f5lqfi/how_serious_is_your_org_about_data_quality/

Role of High-Quality Data

One potential solution I have everyone sharing is synthetic data generation. Models can be trained to generate high-quality data from scratch, effectively creating new learning materials that reflect the underlying distribution of real-world knowledge. This approach has been successful in domains like chess — where AI agents learn by playing against themselves, achieving training without human-generated data or usage/observability/player log data.

Another promising approach is reinforcement learning and reasoning models, which allow AI to improve by reflecting on its own thought processes. This method not only expands the available training data but also enhances model efficiency and problem-solving abilities.

I’ve been a Data Engineering guy for the last decade, so my solution for bad data is immediately a technical solution like below — more cleaning scripts, better validation rules, improved monitoring dashboards.

The Technical Reflex

Picture this common scenario: Bad data appears in a report. Our immediate response?

Write a cleaning script
Add validation rules
Create monitoring alerts
Build data quality dashboards

Like these:

DQ dimensions that I’ve been prioritizing —

DQ Dimensions: 9 that are important for my Enterprise

DQ Standard Dashboard for sharing updates to show progress. by Author for Illustrative purpose.

The Enterprise Reality Check

Here’s the truth that I’ve learn the hard way: The best technical solution can’t fix a process problem.

Consider these common scenarios:

A perfect validation script can’t fix inconsistent data entry practices
The most robust ETL pipeline can’t resolve disagreements about business rules
Real-time quality monitoring can’t replace clear data ownership.

Path to Maturity

– in data engineering often looks like this:

Junior: “I’ll fix it with code”
Mid-level: “I’ll build a system to prevent it”
Senior: “Let’s understand why this happens”
Lead: “We need to change how we work”

The best technical solution can’t fix a broken process.

Why Technical Band-Aids Fail

These solutions work… until they don’t. And here’s why:

They treat symptoms, not causes

Clean data today, same issues tomorrow
Growing maze of scripts and rules

They miss the human element

Data entry remains error-prone
Business processes stay broken
Communication gaps persist
No ownership of quality

The Limits of Data: Are We Nearing a Ceiling?

A pressing question in AI development is whether we will hit a ceiling due to data limitations. Some argue that while scaling has driven progress so far, we may eventually exhaust high-quality training data, leading to diminishing returns. Others believe that innovations in reasoning models, reinforcement learning, and self-supervised learning will continue pushing the boundaries of AI capabilities.

Another challenge is data integration and consistency. Large AI models require data that is well-structured, diverse, and representative of real-world complexity. Many enterprises including ours struggle with fragmented data sources, inconsistencies, and lack of proper governance, which hinder AI implementation let alone performance.

Additionally, the computing costs associated with handling vast amounts of data remain a significant factor. AI model training requires extensive computational resources, with companies investing billions in AI clusters. Managing these costs efficiently is crucial to sustaining AI advancements.

Contextual Data Integration

Combine different data sources meaningfully
Focus on creating value through data relationships
Maintain privacy while leveraging data connections

Additionally, concerns about bias in AI models stem from biased training data. If historical data contains systemic biases, AI models will learn and reinforce these biases unless explicitly corrected. Techniques such as bias mitigation, fairness auditing, and explainability methods are essential to ensure that AI systems remain equitable and trustworthy.

Another growing concern is AI-generated data pollution — as AI models generate more content, the internet is becoming saturated with synthetic text. Ensuring that future models are trained on high-quality, human-authored content rather than AI-generated noise is a challenge that must be addressed.

Looking Ahead

In my opinion, if the current trends continues, AI models will soon reach and exceed human-level performance in many professional domains. However, the next phase of AI development will depend not just on increasing model size but on innovative ways to utilize and refine data. Whether through synthetic data, better data curation, or novel foundational architectures with embedded flexible processes, the relationship between data and AI will remain at the heart of progress.

Path Forward

Key Lessons

1. Start with Process, Not Code

Technical solutions should enable good processes, not compensate for bad ones. Before opening your IDE, ask:

Who owns this data?
Why is quality breaking down?
Where does the problem really start?
What business processes are involved?

2. Build Bridges, Not Just Solutions

Sustainable quality requires both technical and organizational changes

Talk to business users
Understand their workflows
Learn their pain points
Make them partners in solutions

3. Create Sustainable Systems

Your value as a data engineer grows when you bridge technical and business needs.

Document processes, not just code
Train people, not just implement tools
Build accountability frameworks
Establish ownership

In Enterprises, sometimes the best solution isn’t more code — it’s better processes.

Remember: Your technical skills are still crucial, but they’re most powerful when applied in support of well-designed processes and clear organizational responsibilities.

Useful links: The “Who Does What” Guide To Enterprise Data Quality | by Michael Segner | Towards Data Science | Medium

The future of AI depends not just on having more data, but on having better data. Organizations that prioritize data quality and build flexible data infrastructure will be best positioned to leverage AI’s potential.

The organizations that master these elements will be the ones that lead in the AI-driven future. In the end, the story of AI isn’t just about algorithms or computing power — it’s about the quality of the data that drives them.

Thanks for reading.

https://x.com/richiebachala

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality

Author(s): Richie Bachala

Beyond Scale: Data Quality for AI Infrastructure

The Scaling Hypothesis: Bigger Data, Better AI?

The Data Quality Conundrum

Data Quality First

Role of High-Quality Data

The Technical Reflex

The Enterprise Reality Check

Path to Maturity

Why Technical Band-Aids Fail

The Limits of Data: Are We Nearing a Ceiling?

Looking Ahead

Path Forward

Key Lessons

1. Start with Process, Not Code

2. Build Bridges, Not Just Solutions

3. Create Sustainable Systems

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Scaling Intelligence: Overcoming Infrastructure Challenges in Large Language Model Operations

From Code to Conversation: The Rise of Seamless MLOps-DevOps Fusion in Large Language Models

Why Most Task Automation Fails — and How AI Agents Can Fix It

Exploring Deep Learning Models: Comparing ANN vs CNN for Image Recognition

LAI #72: From Python Groundwork to Function Calling, ICL Theory, and Load Balancing MoEs

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality

Author(s): Richie Bachala

Beyond Scale: Data Quality for AI Infrastructure

The Scaling Hypothesis: Bigger Data, Better AI?

The Data Quality Conundrum

Data Quality First

Role of High-Quality Data

The Technical Reflex

The Enterprise Reality Check

Path to Maturity

Why Technical Band-Aids Fail

The Limits of Data: Are We Nearing a Ceiling?

Looking Ahead

Path Forward

Key Lessons

1. Start with Process, Not Code

2. Build Bridges, Not Just Solutions

3. Create Sustainable Systems

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥