Part 4: Data Manipulation in Data Cleaning
Last Updated on March 11, 2026 by Editorial Team
Author(s): Raj kumar
Originally published on Towards AI.

There is an assumption many teams carry without fully examining it.
Data cleaning feels responsible.
It feels corrective.
It feels like a necessary step to improve data quality before analysis or machine learning begins.
But data cleaning is not neutral.
Every cleaning decision alters the structure, distribution, and meaning of the dataset. When rows are removed, values are replaced, missing fields are filled, or outliers are excluded, the dataset is no longer a raw representation of reality. It becomes a curated version of it.
Once those changes are applied, the original state of the data is rarely revisited. The cleaned dataset becomes the new baseline, and all downstream analytics, dashboards, and AI models inherit its assumptions.
Data Cleaning in the Analytical Lifecycle
By the time we reach the cleaning stage in a real-world data pipeline, the data has already passed through multiple transformation layers:
- It has been imported from operational systems.
- It has been inspected for structure, completeness, and anomalies.
- It has been selected to narrow the focus to relevant rows and columns.
Cleaning is where the process shifts from observation to intervention.
Up to this point, we have primarily evaluated the data. During cleaning, we begin editing it.
This distinction is critical in production environments. Editing introduces judgment. Judgment introduces bias. And bias, if undocumented, becomes structural.
Cross-Industry Impact of Data Cleaning Decisions
Across domains such as banking transactions, insurance claims, retail purchases, healthcare records, agricultural yield data, and large-scale digital platforms, data cleaning determines what qualifies as acceptable evidence.
Is a missing value treated as zero?
Is an extreme transaction classified as fraud or removed as noise?
Is a duplicate record an operational error or a legitimate reversal?
What appears to be “noise” may represent rare but important signal.
What appears to be “outlier behavior” may be the very pattern a risk model needs to detect.
In machine learning systems, data cleaning directly influences:
- Feature distributions
- Model bias
- Prediction stability
- Generalization performance
- Regulatory explainability
Cleaning is not merely technical preprocessing. It is a functional and strategic decision layer within the data lifecycle.
A Deliberate Approach to Cleaning Operations
The operations shown example below— dropping null values, filling missing data, replacing values, removing duplicates, resetting indexes, renaming columns, converting data types, and handling outliers — are not just syntactic tools in Pandas.
They are mechanisms for reshaping reality.
Each operation must be applied deliberately, with an understanding of its technical consequences and business implications. When implemented carefully, cleaning improves reliability and clarity. When applied casually, it introduces distortion that surfaces later as unexplained model drift, inconsistent KPIs, or regulatory questions.
That is why, in mature analytics and AI environments, data cleaning is not treated as a routine step. It is treated as a controlled intervention.
Let’s walk through every operation shown in the image, carefully and deliberately.
Dropping Null Values: df.dropna()
df.dropna()
Dropping nulls feels efficient. It removes incomplete rows and simplifies analysis.
But missing data is rarely random.
When we drop rows with null values, we may be systematically removing:
- Low-income customers
- Incomplete applications
- Rare edge cases
- Operational failures
In regulated environments, dropping nulls without justification can create hidden bias.
The real question is not “Can we drop nulls?” It is “What pattern do these nulls represent?”
Filling Null Values: df.fillna(value)
df.fillna(0)
Filling missing values appears constructive. It preserves dataset size.
But replacing null with zero changes meaning.
Is zero a real value? Or is it a placeholder that now behaves like truth?
Many model distortions originate from inappropriate imputation decisions made casually.
Forward Fill Method: df.fillna(method='ffill')
df.fillna(method='ffill')
Forward fill assumes continuity. This may make sense in time-series data like stock prices or sensor readings. It may not make sense for customer risk scores or claim statuses.
Forward propagation embeds the assumption that the last observed value remains valid.
That assumption must be defensible.
Replacing Values: df.replace(old_value, new_value)
df.replace("N/A", None)
Replacement looks harmless. Standardizing values is necessary. But replacement logic often spreads through pipelines without documentation.
Replacing “UNKNOWN” with null removes nuance.
Replacing categories merges distinctions that may matter later.
Cleaning should simplify without oversimplifying.
Removing Duplicates: df.drop_duplicates()
df.drop_duplicates()
Duplicates are rarely accidental.
In financial systems, duplicates may represent:
- Reversals
- Adjustments
- Legitimate repeat transactions
- Parallel submissions
Dropping duplicates without understanding origin can erase valid behavior.
Deduplication requires business context, not just syntax.
Resetting Index: df.reset_index(drop=True)
df.reset_index(drop=True)
Resetting index seems cosmetic. But index often carries structural meaning — transaction order, ingestion sequence, original identifiers.
Dropping index can remove traceability if done carelessly.
Cleaning should preserve lineage.
Renaming Columns: df.rename(columns={'old_name': 'new_name'})
df.rename(columns={'amt': 'amount'})
Renaming improves clarity. It also standardizes vocabulary. However, renaming must remain consistent across systems. If two pipelines rename the same column differently, reconciliation becomes fragile.
Column names are contracts between teams.
Setting Column as Index: df.set_index('column')
df.set_index('transaction_id')
Setting an index changes how data is accessed and joined. If the chosen column is not truly unique, future merges can silently fail or duplicate records.
Indexing decisions should be deliberate, not convenient.
Converting Data Types: df['column'].astype('int64')
df['amount'] = df['amount'].astype('int64')
Type conversion enforces structure. But forcing a type may hide malformed values. Coercion can mask anomalies that inspection would otherwise surface.
Type enforcement should follow validation, not precede it.
Handling Outliers with Quantiles
df = df[
(df['column'] < df['column'].quantile(0.95)) &
(df['column'] > df['column'].quantile(0.05))
]
Outlier removal is one of the most controversial cleaning steps.
Extreme values may represent:
- Fraud
- Operational breakdown
- Rare but legitimate events
- Data corruption
Removing them improves statistical neatness but may reduce real-world accuracy.
Especially in risk modeling, outliers are often the signal.
A Unified Cleaning Workflow (End-to-End Example)
Below is a single structured cleaning workflow that reflects how experienced teams clean data responsibly.
import pandas as pd
df = pd.read_csv("transactions.csv")
# Handle missing values carefully
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# Preserve rows but flag missing data
df['missing_amount_flag'] = df['amount'].isnull()
# Replace placeholder values
df = df.replace("N/A", None)
# Remove exact duplicates after review
df = df.drop_duplicates()
# Rename columns for clarity
df = df.rename(columns={'amt': 'amount'})
# Convert types only after validation
df['amount'] = df['amount'].astype('float64')
# Handle outliers conservatively
lower = df['amount'].quantile(0.05)
upper = df['amount'].quantile(0.95)
df = df[(df['amount'] >= lower) & (df['amount'] <= upper)]
# Reset index for consistency
df = df.reset_index(drop=True)
print("Cleaning completed.")
This workflow does not aim for perfection. It aims for defensibility.
Why Data Cleaning Is Often Where Bias Is Introduced
Data cleaning feels technical, but its impact is deeply behavioral.
When we clean data, we reshape its distribution. We change how often categories appear. We remove records that look unusual. We smooth extremes. We standardize irregularities. Each of these actions makes the dataset more orderly, but also more curated.
And curation always carries perspective.
Edge cases disappear. Rare events shrink. Minority segments lose representation. Outliers that once signaled risk, fraud, operational failure, or innovation may quietly vanish. The dataset begins to look stable, predictable, and statistically comfortable.
Then that cleaned dataset enters production.
At that point, the cleaning logic becomes invisible. It runs automatically. It is rarely questioned. Over time, teams forget which assumptions were introduced and why. The cleaned version of reality becomes the only version anyone sees.
Most analytical distortions that surface months later are not caused by complex models or advanced algorithms. They originate from cleaning rules that were written quickly, never revisited, and slowly hardened into system behavior.
Bias rarely begins in modeling. It begins in preparation.
Closing Thoughts and What Comes Next
Data cleaning is not about making data look tidy. It is about making decisions.
It forces us to decide which imperfections we are willing to tolerate, which inconsistencies we will standardize, and which records we are comfortable excluding. Those decisions shape every metric, every dashboard, and every prediction that follows.
In Part 5: Data Manipulation in Data Transformation, we move to the next layer of influence. We will examine how raw operational data is converted into engineered features, derived metrics, and structured signals that power analytics platforms and machine learning systems. This is where transformation logic begins to amplify or constrain what the data can ultimately reveal.
If this perspective resonated with your experience, consider engaging with the series. Clap to support thoughtful and responsible data practices, follow along as we move from foundational manipulation to advanced transformation, share it with teams managing production data pipelines and machine learning systems, and use the comments to reflect on cleaning decisions you’ve had to explain, defend, or revisit in real-world environments.
Because the most influential data manipulation does not happen during modeling.
It happens quietly, and often permanently, during cleaning.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.