Data Scrubbing: How Cleaning Your Data Can Shape Better Machine Learning Models
Last Updated on October 20, 2024 by Editorial Team
Author(s): Souradip Pal
Originally published on Towards AI.
Discover the importance of data scrubbing, how it refines datasets, and the techniques to prepare data for machine learning, including feature selection, row compression, and one-hot encoding.
This member-only story is on us. Upgrade to access all of Medium.
Picture this: Youβre at the farmerβs market, and you come across a basket of fresh apples. But hold on, some of them have bruises, a few have wormholes, and others are oddly shaped. You canβt make a delicious pie with these as they are, right? Youβll need to sort through, clean up, and trim off the bad parts before you get to the juicy core. Well, working with datasets is much the same. Before we can build accurate machine learning models or glean valuable insights, we need to βscrubβ our data β a process known as data scrubbing.
In this article, weβll dive deep into the techniques of data scrubbing, including feature selection, row compression, and handling missing data, showing you how the cleanup process is a critical step before putting your dataset to work.
Note: All Images used in the blog are generated by Dall-E
Data scrubbing is the process of cleaning, refining, and organizing raw datasets to make them usable and efficient for analysis and modeling. Just like washing and cutting fruit before making a smoothie, you have to remove irrelevant, incomplete, or duplicated data.
Messy DatasetFrom converting text-based data into… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI