Join thousands of AI enthusiasts and experts at the Learn AI Community.

Publication

Machine Learning

Quality Data Drives the success of Machine Learning and Artificial Intelligence

Last Updated on July 18, 2020 by Editorial Team

Author(s): Mohua Sen

AI/ML application to perform analysis and generate insights, you need to provide useful qualityย data.

History says the 16th century was the time during which the rise of Western civilization occurred. During this time, Spain and Portugal explored the Indian Ocean and opened worldwide oceanic trade routes, and Vasco da Gama was given permission by the Indian Sultans to settle in the wealthy Bengal Sultanate. Large parts of the New World became Spanish and Portuguese colonies, and as the Portuguese became the masters of Asiaโ€™s and Africaโ€™s Indian Ocean trade, the Spanish opened trade across the Pacific Ocean, linking the Americas withย India.

Another linking happened between minds and machines during this time. French philosopher, scientist & metaphysician, Renรฉ Descartes (1596โ€“1650), came up with a world in his mind where machines could make decisions. And then, in 1956, an American computer scientist and cognitive scientist John McCarthy coined the term Artificial Intelligence (AI), which defines โ€œthe science and engineering of making intelligent machines.โ€ AI is the ability of a computer program or a machine to think andย learn.

As time rolled over, at present, in 2020, we are now using AI widely across sectors. Be it supporting organizations to take well-thought-out decisions, or something as regular routine as sorting our emails, or to even helping credit risk manager or detecting financial fraud, this branch of technology, by teaming up with advanced data analytics has all the markings for creating a revolutionary effect.

The AI scenarios show the technologyโ€™s unbelievable computational power, but in practical, operative applications begin with data. Data is the fundamental of any advanced analytics algorithms, which are the backbone of AI/ML models. It must be supplied in the required form that the algorithm understands. The main function of AI/ML algorithms is to unlock the concealed information available in the data. The algorithm will be resulting in incorrect insights if the data available is of poor quality. This might end in revenue loss for the project or organization. A Forrester report on โ€œAI Experiences A Reality Checkโ€ indicates that the data quality is one of the utmost challenges towards accomplishing a desired result from the AI/ML systems in enterprises. Most organizations lack a clear understanding of the right data needed for ML models (according to Forrester), and hence businesses often struggle with data preparation as perย need.

Human beings learn from experience. I remember when I learned things in my life, when I was younger, like hitting my finger on a hot plate taught me how to deal with it in the future through perception. On the contrary, machines follow instructions. They need to be trained, programmed to do things, e.g., any car manufacturing company has machines that put different parts togetherโ€Šโ€”โ€Šthey are programmed, they are just following instructions. But for machine learning is a process where both things are tied togetherโ€Šโ€”โ€Šlearning from experience and following instructions. Here the only difference is โ€œlearning from data,โ€ so we need good quality data to make it effective. And to control the quality of data, one needs rules in place. So how is good dataย defined?

While describing good data quality, we should focus on the important dimensions of data quality. Though not all dimensions may be relevant for every field, one should have a clear understanding of these dimensions while thinking of enhancing the quality ofย data.

Completenessโ€Šโ€”โ€ŠLevel at which desired data attributes are supplied. Your data does not need to be 100% complete, but you need to keep the focus on a few areasโ€Šโ€”โ€ŠAre there any value missing? Are data being captured in the full extent at the source? For data to be useful, you need to see the whole picture, not just part of it. For example, all employees have a location.

Accuracyโ€Šโ€”โ€ŠDegree to which data should match the agreed source. For example, the initial base salary reflects the amount on the contract.

Uniquenessโ€Šโ€”โ€ŠExtent that data should be uniquely stored in one place and not duplicated e.g., there must not exist multiple records for the same employee. Each record should be unique based on a given criterion; otherwise, the risk of accessing outdated information increases.

Integrityโ€Šโ€”โ€ŠData is traceable back to the source. Itโ€™s the extent to which data adheres to defined business rules, accepted values, and accepted formats e.g., employee gender is F, M, orย U.

Consistencyโ€Šโ€”โ€ŠExtent to which identical data must have the same value wherever it is stored or displayed. For example, the aggregated base salary by cost center is consistent betweenย systems.

Timelinessโ€Šโ€”โ€ŠDoes the data represent reality from the required point in time? The data should be refreshed, including acceptable systems โ€˜lagโ€™ when values change e.g., base salary updated after promotion within xย days.

So to have good quality data, at the initial level, data quality assessment needs to be performed in order to confirm the data quality dimensions, and subsequently, a remediation process should be in place to prevent any data issues at the source. According to research, inaccurate or incomplete data can lead to a 20% drop in productivity; i.e., companies that did put a focus on high-quality data saw a revenue increase of aroundย 20%.

We can see that high-quality data is the need for the hour, and every organization should establish a data quality assessment process at the source itself so that all the downstream applications can have data of good health. The far-fetched influence of AI/ML models might get overlooked or delayed due to poor quality of data. The data quality and master data management is the utmost important part of this competitive era to reduce cost. We should remember the 1โ€“10โ€“100 rule: โ€œIt costs: $1 to verify the accuracy of data at the point of entry, $10 to correct or clean up data in batch form, and $100 (or more) per record if nothing is done at the initialย levelโ€.

References

  1. Vadime Elisseeff (1998). The Silk Roads: Highways of Culture and Commerce. Berghahn Books. ISBN 978โ€“1โ€“57181โ€“221โ€“6.
  2. Nanda, J. N (2005). Bengal: the unique state. Concept Publishing Company. p. 10. 2005. ISBN 978โ€“81โ€“8069โ€“149โ€“2. Bengal [โ€ฆ] was rich in the production and export of grain, salt, fruit, liquors and wines, precious metals, and ornaments besides the output of its handlooms in silk and cotton. Europe referred to Bengal as the richest country to tradeย with.
  3. โ€œPortuguese, Theโ€Šโ€”โ€ŠBanglapedia.โ€ en.banglapedia.org. Archived from the original on 1 Aprilย 2017.
  4. Portal: Modern historyโ€Šโ€”โ€ŠWikipedia. en.wikipedia.org/wiki/Portal:Modern_history
  5. โ€œ16th centuryโ€. en.wikipedia.org.
  6. โ€œWhat is AI? / Basic Questions,โ€. jmc.stanford.edu/artificial-intelligence.
  7. โ€œArtificial intelligenceโ€Šโ€”โ€ŠSimple English Wikipedia.โ€ simple.wikipedia.org/wiki/Artificial_intelligence
  8. โ€œData Is The Foundation For Artificial Intelligence.โ€ www.forbes.com. Octย 2018.
  9. โ€œThe 5 Key Reasons Why Data Quality Is So Importantโ€. cerasis.com/data-quality.
  10. โ€œThe Cost of Quality: The 1โ€“10โ€“100 Ruleโ€. www.makingstrategyhappen.com.
  11. โ€œForrester Infographic: AI Experiences A Reality Check.โ€ www.forrester.com/report/. Mayย 2019.


Quality Data Drives the success of Machine Learning and Artificial Intelligence was originally published in Towards AIโ€Šโ€”โ€ŠMultidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback โ†“