Data Scrubbing: How Cleaning Your Data Can Shape Better Machine Learning Models
Last Updated on October 20, 2024 by Editorial Team
Author(s): Souradip Pal
Originally published on Towards AI.
Discover the importance of data scrubbing, how it refines datasets, and the techniques to prepare data for machine learning, including feature selection, row compression, and one-hot encoding.
This member-only story is on us. Upgrade to access all of Medium.
Picture this: You’re at the farmer’s market, and you come across a basket of fresh apples. But hold on, some of them have bruises, a few have wormholes, and others are oddly shaped. You can’t make a delicious pie with these as they are, right? You’ll need to sort through, clean up, and trim off the bad parts before you get to the juicy core. Well, working with datasets is much the same. Before we can build accurate machine learning models or glean valuable insights, we need to “scrub” our data — a process known as data scrubbing.
In this article, we’ll dive deep into the techniques of data scrubbing, including feature selection, row compression, and handling missing data, showing you how the cleanup process is a critical step before putting your dataset to work.
Note: All Images used in the blog are generated by Dall-E
Data scrubbing is the process of cleaning, refining, and organizing raw datasets to make them usable and efficient for analysis and modeling. Just like washing and cutting fruit before making a smoothie, you have to remove irrelevant, incomplete, or duplicated data.
From converting text-based data into… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy Resources:
We build Enterprise AI. We teach what we learn. 15 AI Experts. 5 practical AI courses. 100k students
Free: 6-day Agentic AI Engineering Email Guide
Get your free Agents Cheatsheet here. Our proven framework for choosing the right AI architecture.
3 years of hands-on work with real clients into 6 pages.
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Discover Your Dream AI Career at Towards AI JobsOur jobs board is tailored specifically to AI, Machine Learning and Data Science Jobs and Skills. Explore over 100,000 live AI jobs today with Towards AI Jobs!
Note: Article content contains the views of the contributing authors and not Towards AI.