Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Quality Data Drives the success of Machine Learning and Artificial Intelligence
Machine Learning

Quality Data Drives the success of Machine Learning and Artificial Intelligence

Last Updated on July 18, 2020 by Editorial Team

Author(s): Mohua Sen

AI/ML application to perform analysis and generate insights, you need to provide useful quality data.

History says the 16th century was the time during which the rise of Western civilization occurred. During this time, Spain and Portugal explored the Indian Ocean and opened worldwide oceanic trade routes, and Vasco da Gama was given permission by the Indian Sultans to settle in the wealthy Bengal Sultanate. Large parts of the New World became Spanish and Portuguese colonies, and as the Portuguese became the masters of Asia’s and Africa’s Indian Ocean trade, the Spanish opened trade across the Pacific Ocean, linking the Americas with India.

Another linking happened between minds and machines during this time. French philosopher, scientist & metaphysician, René Descartes (1596–1650), came up with a world in his mind where machines could make decisions. And then, in 1956, an American computer scientist and cognitive scientist John McCarthy coined the term Artificial Intelligence (AI), which defines “the science and engineering of making intelligent machines.” AI is the ability of a computer program or a machine to think and learn.

As time rolled over, at present, in 2020, we are now using AI widely across sectors. Be it supporting organizations to take well-thought-out decisions, or something as regular routine as sorting our emails, or to even helping credit risk manager or detecting financial fraud, this branch of technology, by teaming up with advanced data analytics has all the markings for creating a revolutionary effect.

The AI scenarios show the technology’s unbelievable computational power, but in practical, operative applications begin with data. Data is the fundamental of any advanced analytics algorithms, which are the backbone of AI/ML models. It must be supplied in the required form that the algorithm understands. The main function of AI/ML algorithms is to unlock the concealed information available in the data. The algorithm will be resulting in incorrect insights if the data available is of poor quality. This might end in revenue loss for the project or organization. A Forrester report on “AI Experiences A Reality Check” indicates that the data quality is one of the utmost challenges towards accomplishing a desired result from the AI/ML systems in enterprises. Most organizations lack a clear understanding of the right data needed for ML models (according to Forrester), and hence businesses often struggle with data preparation as per need.

Human beings learn from experience. I remember when I learned things in my life, when I was younger, like hitting my finger on a hot plate taught me how to deal with it in the future through perception. On the contrary, machines follow instructions. They need to be trained, programmed to do things, e.g., any car manufacturing company has machines that put different parts together — they are programmed, they are just following instructions. But for machine learning is a process where both things are tied together — learning from experience and following instructions. Here the only difference is “learning from data,” so we need good quality data to make it effective. And to control the quality of data, one needs rules in place. So how is good data defined?

While describing good data quality, we should focus on the important dimensions of data quality. Though not all dimensions may be relevant for every field, one should have a clear understanding of these dimensions while thinking of enhancing the quality of data.

Completeness — Level at which desired data attributes are supplied. Your data does not need to be 100% complete, but you need to keep the focus on a few areas — Are there any value missing? Are data being captured in the full extent at the source? For data to be useful, you need to see the whole picture, not just part of it. For example, all employees have a location.

Accuracy — Degree to which data should match the agreed source. For example, the initial base salary reflects the amount on the contract.

Uniqueness — Extent that data should be uniquely stored in one place and not duplicated e.g., there must not exist multiple records for the same employee. Each record should be unique based on a given criterion; otherwise, the risk of accessing outdated information increases.

Integrity — Data is traceable back to the source. It’s the extent to which data adheres to defined business rules, accepted values, and accepted formats e.g., employee gender is F, M, or U.

Consistency — Extent to which identical data must have the same value wherever it is stored or displayed. For example, the aggregated base salary by cost center is consistent between systems.

Timeliness — Does the data represent reality from the required point in time? The data should be refreshed, including acceptable systems ‘lag’ when values change e.g., base salary updated after promotion within x days.

So to have good quality data, at the initial level, data quality assessment needs to be performed in order to confirm the data quality dimensions, and subsequently, a remediation process should be in place to prevent any data issues at the source. According to research, inaccurate or incomplete data can lead to a 20% drop in productivity; i.e., companies that did put a focus on high-quality data saw a revenue increase of around 20%.

We can see that high-quality data is the need for the hour, and every organization should establish a data quality assessment process at the source itself so that all the downstream applications can have data of good health. The far-fetched influence of AI/ML models might get overlooked or delayed due to poor quality of data. The data quality and master data management is the utmost important part of this competitive era to reduce cost. We should remember the 1–10–100 rule: “It costs: $1 to verify the accuracy of data at the point of entry, $10 to correct or clean up data in batch form, and $100 (or more) per record if nothing is done at the initial level”.


  1. Vadime Elisseeff (1998). The Silk Roads: Highways of Culture and Commerce. Berghahn Books. ISBN 978–1–57181–221–6.
  2. Nanda, J. N (2005). Bengal: the unique state. Concept Publishing Company. p. 10. 2005. ISBN 978–81–8069–149–2. Bengal […] was rich in the production and export of grain, salt, fruit, liquors and wines, precious metals, and ornaments besides the output of its handlooms in silk and cotton. Europe referred to Bengal as the richest country to trade with.
  3. “Portuguese, The — Banglapedia.” Archived from the original on 1 April 2017.
  4. Portal: Modern history — Wikipedia.
  5. “16th century”.
  6. “What is AI? / Basic Questions,”.
  7. “Artificial intelligence — Simple English Wikipedia.”
  8. “Data Is The Foundation For Artificial Intelligence.” Oct 2018.
  9. “The 5 Key Reasons Why Data Quality Is So Important”.
  10. “The Cost of Quality: The 1–10–100 Rule”.
  11. “Forrester Infographic: AI Experiences A Reality Check.” May 2019.

Quality Data Drives the success of Machine Learning and Artificial Intelligence was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓