When Scripts Aren’t Enough: Building Sustainable Enterprise Data Quality
Author(s): Richie Bachala
Originally published on Towards AI.
Beyond Scale: Data Quality for AI Infrastructure
The trajectory of AI over the past decade has been driven largely by the scale of data available for training and the ability to process it with increasingly powerful compute & experimental models. From speech recognition breakthroughs to large-scale language models, the story of AI is fundamentally a story of data.
The Scaling Hypothesis: Bigger Data, Better AI?
I’ll say it again — the story of artificial intelligence over the past decade is fundamentally a story about data.
What began as a series of experiments in speech recognition has evolved into an understanding of how AI systems learn and grow.
The key insight? Scale matters, but quality matters more.
Early AI researchers discovered something remarkable: when you feed neural networks more data, they continue to improve. This wasn’t just true for speech recognition — it held across language processing, computer vision, and even mathematical reasoning. This observation led to what we now call the Scaling Hypothesis.
Think of it as a perfectly balanced chemical reaction. Three critical ingredients must scale together:
Larger neural networks
More training data
Increased computing power
If any one ingredient falls short while the others grow, progress stalls. But when scaled in harmony, AI capabilities expand exponentially.
- Analogous to a chemical reaction where all reagents must scale proportionally
- Scaling laws observed across multiple domains (language, images, video, math)
Source: Kaplan et al. (2020) — “Scaling Laws for Neural Language Models”
https://arxiv.org/pdf/2001.08361
- First formal study documenting empirical scaling laws
- Published by OpenAI
The Data Quality Conundrum
Not all data is created equal. The internet may offer trillions of words, but much of it is:
- Repetitive content
- SEO-optimized fluff
- AI-generated text
- Low-value information
This has led to concerns about whether AI will eventually run out of useful training data. Question I hear being asked often in many podcasts to various product leads in AI.
Data Quality First
In my opinion —
- Prioritize data quality before AI model selection
- Ensure data readiness for intended use cases
- Validate data compatibility with chosen AI solutions
Check this fun Thread to follow on reddit to see why — it covers & cuts across many industries — Banks, AdTech, Mfg, BigTech, Data sellers etc.
Role of High-Quality Data
One potential solution I have everyone sharing is synthetic data generation. Models can be trained to generate high-quality data from scratch, effectively creating new learning materials that reflect the underlying distribution of real-world knowledge. This approach has been successful in domains like chess — where AI agents learn by playing against themselves, achieving training without human-generated data or usage/observability/player log data.
Another promising approach is reinforcement learning and reasoning models, which allow AI to improve by reflecting on its own thought processes. This method not only expands the available training data but also enhances model efficiency and problem-solving abilities.
I’ve been a Data Engineering guy for the last decade, so my solution for bad data is immediately a technical solution like below — more cleaning scripts, better validation rules, improved monitoring dashboards.
The Technical Reflex
Picture this common scenario: Bad data appears in a report. Our immediate response?
- Write a cleaning script
- Add validation rules
- Create monitoring alerts
- Build data quality dashboards
Like these:
DQ dimensions that I’ve been prioritizing —
The Enterprise Reality Check
Here’s the truth that I’ve learn the hard way: The best technical solution can’t fix a process problem.
Consider these common scenarios:
- A perfect validation script can’t fix inconsistent data entry practices
- The most robust ETL pipeline can’t resolve disagreements about business rules
- Real-time quality monitoring can’t replace clear data ownership.
Path to Maturity
– in data engineering often looks like this:
- Junior: “I’ll fix it with code”
- Mid-level: “I’ll build a system to prevent it”
- Senior: “Let’s understand why this happens”
- Lead: “We need to change how we work”
The best technical solution can’t fix a broken process.
Why Technical Band-Aids Fail
These solutions work… until they don’t. And here’s why:
They treat symptoms, not causes
- Clean data today, same issues tomorrow
- Growing maze of scripts and rules
They miss the human element
- Data entry remains error-prone
- Business processes stay broken
- Communication gaps persist
- No ownership of quality
The Limits of Data: Are We Nearing a Ceiling?
A pressing question in AI development is whether we will hit a ceiling due to data limitations. Some argue that while scaling has driven progress so far, we may eventually exhaust high-quality training data, leading to diminishing returns. Others believe that innovations in reasoning models, reinforcement learning, and self-supervised learning will continue pushing the boundaries of AI capabilities.
Another challenge is data integration and consistency. Large AI models require data that is well-structured, diverse, and representative of real-world complexity. Many enterprises including ours struggle with fragmented data sources, inconsistencies, and lack of proper governance, which hinder AI implementation let alone performance.
Additionally, the computing costs associated with handling vast amounts of data remain a significant factor. AI model training requires extensive computational resources, with companies investing billions in AI clusters. Managing these costs efficiently is crucial to sustaining AI advancements.
Contextual Data Integration
- Combine different data sources meaningfully
- Focus on creating value through data relationships
- Maintain privacy while leveraging data connections
Additionally, concerns about bias in AI models stem from biased training data. If historical data contains systemic biases, AI models will learn and reinforce these biases unless explicitly corrected. Techniques such as bias mitigation, fairness auditing, and explainability methods are essential to ensure that AI systems remain equitable and trustworthy.
Another growing concern is AI-generated data pollution — as AI models generate more content, the internet is becoming saturated with synthetic text. Ensuring that future models are trained on high-quality, human-authored content rather than AI-generated noise is a challenge that must be addressed.
Looking Ahead
In my opinion, if the current trends continues, AI models will soon reach and exceed human-level performance in many professional domains. However, the next phase of AI development will depend not just on increasing model size but on innovative ways to utilize and refine data. Whether through synthetic data, better data curation, or novel foundational architectures with embedded flexible processes, the relationship between data and AI will remain at the heart of progress.
Path Forward
Key Lessons
1. Start with Process, Not Code
Technical solutions should enable good processes, not compensate for bad ones. Before opening your IDE, ask:
- Who owns this data?
- Why is quality breaking down?
- Where does the problem really start?
- What business processes are involved?
2. Build Bridges, Not Just Solutions
Sustainable quality requires both technical and organizational changes
- Talk to business users
- Understand their workflows
- Learn their pain points
- Make them partners in solutions
3. Create Sustainable Systems
Your value as a data engineer grows when you bridge technical and business needs.
- Document processes, not just code
- Train people, not just implement tools
- Build accountability frameworks
- Establish ownership
In Enterprises, sometimes the best solution isn’t more code — it’s better processes.
Remember: Your technical skills are still crucial, but they’re most powerful when applied in support of well-designed processes and clear organizational responsibilities.
Useful links: The “Who Does What” Guide To Enterprise Data Quality | by Michael Segner | Towards Data Science | Medium
The future of AI depends not just on having more data, but on having better data. Organizations that prioritize data quality and build flexible data infrastructure will be best positioned to leverage AI’s potential.
The organizations that master these elements will be the ones that lead in the AI-driven future. In the end, the story of AI isn’t just about algorithms or computing power — it’s about the quality of the data that drives them.
Thanks for reading.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI