Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Part 5: Data Manipulation in Data Transformation
Data Science   Latest   Machine Learning

Part 5: Data Manipulation in Data Transformation

Last Updated on March 11, 2026 by Editorial Team

Author(s): Raj kumar

Originally published on Towards AI.

Part 5: Data Manipulation in Data Transformation

By the time we reach transformation in a data pipeline, the dataset usually appears stable. It has been imported with structure, inspected with skepticism, selected with intent, and cleaned through deliberate intervention. At this stage, many teams feel that the foundational risk has already been addressed. The data looks controlled. The inconsistencies have been handled. The structure makes sense.

But this is precisely where a different kind of influence begins.

Transformation is not about correcting data. It is about defining what the data will ultimately represent. Raw operational records rarely align directly with analytical meaning. Banking systems store transactions, balances, and events. Insurance platforms record claims and policy activity. Retail systems log orders and returns. Healthcare systems track encounters and treatments. None of these are analytical constructs by themselves.

Analytics operates on derived representations. Risk bands. Exposure ratios. Behavioral segments. Aggregated metrics. Engineered features.

Those representations do not emerge automatically from models. They are constructed through transformation.

And once constructed, they become structural. In banking and financial services, transformation logic influences how creditworthiness is summarized, how fraud thresholds are calculated, and how regulatory capital metrics are derived. In insurance, it shapes how exposure is measured and how premium fairness is evaluated. In retail and digital platforms, it determines how customer behavior is interpreted and prioritized. In healthcare systems, it influences how patient risk is categorized.

Transformation is often described as feature engineering. That description is incomplete. It is more accurately a layer of interpretation embedded directly into data systems.

Once transformation logic moves from experimentation into production pipelines, dashboards, and feature stores, it becomes difficult to question. The derived representation gradually replaces the raw reality.

This part of the series examines data transformation as a structural decision layer — one that shapes signal, bias, interpretability, and long-term system behavior far more than most teams initially recognize.

Data Transformation Techniques

1. Apply Function to Column

When we need to transform values in a single column, apply() with a lambda function is your go-to tool. This is useful for mathematical transformations, string manipulations, or any custom logic you need to apply row-by-row.

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({'column': [1, 2, 3, 4, 5]})

# Apply function to square each value
df['new_column'] = df['column'].apply(lambda x: x*2)
print(df)

Output:

column new_column
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10

2. Apply Function to Multiple Columns

When we need to apply the same transformation to multiple columns simultaneously, you can select those columns and use apply(). This is more efficient than applying the function to each column separately.

# Create sample data with multiple columns
df = pd.DataFrame({
'col1': [1, 2, 3, 4],
'col2': [5, 6, 7, 8]
})

# Apply function to multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda x: x*2)
print(df)

Output:

col1 col2
0 2 10
1 4 12
2 6 14
3 8 16

3. Map Values

map() is perfect for replacing values based on a dictionary mapping. This is commonly used for label encoding, categorizing values, or replacing codes with meaningful names. It's cleaner and faster than multiple if-else statements.

# Create sample data
df = pd.DataFrame({'column': ['old1', 'old2', 'old1', 'old2']})

# Map old values to new values
df['column'] = df['column'].map({'old1': 'new1', 'old2': 'new2'})
print(df)

Output:

column
0 new1
1 new2
2 new1
3 new2

4. Binning

Binning (also called discretization) converts continuous numerical data into categorical bins. This is useful for grouping ages into ranges, income into brackets, or any continuous variable into meaningful categories. pd.cut() creates intervals and assigns labels.

# Create sample data
df = pd.DataFrame({'column': [5, 15, 35, 60, 85, 95]})

# Bin the data into categories
df['binned'] = pd.cut(df['column'],
bins=[0, 25, 50, 75, 100],
labels=['Low', 'Medium', 'High', 'Very High'])
print(df)

Output:

column binned
0 5 Low
1 15 Low
2 35 Medium
3 60 High
4 85 Very High
5 95 Very High

5. One-Hot Encoding

One-hot encoding converts categorical variables into binary columns (0s and 1s). Each unique category becomes its own column. This is essential for machine learning algorithms that require numerical input. pd.get_dummies() automates this process.

# Create sample data
df = pd.DataFrame({'categorical_column': ['A', 'B', 'A', 'C', 'B']})

# One-hot encode
df_encoded = pd.get_dummies(df, columns=['categorical_column'])
print(df_encoded)

Output:

categorical_column_A categorical_column_B categorical_column_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0

6. Normalize Data

Normalization (Min-Max scaling) scales data to a range between 0 and 1. This ensures all features contribute equally to distance-based algorithms like KNN or neural networks. The formula: (x — min) / (max — min).

# Create sample data
df = pd.DataFrame({'values': [10, 20, 30, 40, 50]})

# Normalize data
df['normalized'] = (df['values'] - df['values'].min()) / (df['values'].max() - df['values'].min())
print(df)

Output:

values normalized
0 10 0.00
1 20 0.25
2 30 0.50
3 40 0.75
4 50 1.00

7. Standardize Data

Standardization (Z-score normalization) transforms data to have mean=0 and standard deviation=1. This is crucial for algorithms sensitive to feature scales like SVM, logistic regression, and PCA. Formula: (x — mean) / std.

# Create sample data
df = pd.DataFrame({'values': [10, 20, 30, 40, 50]})

# Standardize data
df['standardized'] = (df['values'] - df['values'].mean()) / df['values'].std()
print(df)

Output:

values standardized
0 10 -1.414214
1 20 -0.707107
2 30 0.000000
3 40 0.707107
4 50 1.414214

8. Log Transformation

Log transformation reduces right-skewed distributions and handles exponential growth patterns. It’s particularly useful for financial data, population growth, or any data with multiplicative relationships. Use np.log() for natural log.

# Create sample data with skewed distribution
df = pd.DataFrame({'values': [1, 10, 100, 1000, 10000]})

# Apply log transformation
df['log_transformed'] = np.log(df['values'])
print(df)

Output:

values log_transformed
0 1 0.000000
1 10 2.302585
2 100 4.605170
3 1000 6.907755
4 10000 9.210340

9. Exponential Transformation

Exponential transformation is the inverse of log transformation. It’s used to reverse log-transformed data or to model exponential growth. np.exp() raises e to the power of each value.

# Create sample data
df = pd.DataFrame({'values': [0, 1, 2, 3, 4]})

# Apply exponential transformation
df['exp_transformed'] = np.exp(df['values'])
print(df)

Output:

values exp_transformed
0 0 1.000000
1 1 2.718282
2 2 7.389056
3 3 20.085537
4 4 54.598150

10. Square Root Transformation

Square root transformation is a moderate transformation that reduces right-skewness (less aggressive than log). It’s useful for count data or when you want to reduce the impact of outliers without compressing the data as much as log does.

# Create sample data
df = pd.DataFrame({'values': [1, 4, 9, 16, 25, 100]})

# Apply square root transformation
df['sqrt_transformed'] = np.sqrt(df['values'])
print(df)

Output:

values sqrt_transformed
0 1 1.0
1 4 2.0
2 9 3.0
3 16 4.0
4 25 5.0
5 100 10.0

PRODUCTION-READY CODE

Now, here’s a comprehensive, production-grade implementation that includes all transformations:-

import pandas as pd
import numpy as np
from typing import Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')



class DataTransformer:
"""
A comprehensive data transformation toolkit for data preprocessing.

This class provides all common data transformation methods used in
data science and machine learning pipelines.
"""


def __init__(self, df: pd.DataFrame):
"""
Initialize the DataTransformer with a DataFrame.

Args:
df: Input pandas DataFrame
"""

self.df = df.copy()
self.original_df = df.copy()

def apply_function_single_column(self,
column: str,
function,
new_column: Optional[str] = None
) -> pd.DataFrame:
"""
Apply a custom function to a single column.

Args:
column: Name of the column to transform
function: Lambda or function to apply
new_column: Name for the new column (if None, overwrites original)

Returns:
Transformed DataFrame
"""

target_col = new_column if new_column else column
self.df[target_col] = self.df[column].apply(function)
print(f"✓ Applied function to '{column}' → '{target_col}'")
return self.df

def apply_function_multiple_columns(self,
columns: List[str],
function
) -> pd.DataFrame:
"""
Apply a custom function to multiple columns simultaneously.

Args:
columns: List of column names to transform
function: Lambda or function to apply

Returns:
Transformed DataFrame
"""

self.df[columns] = self.df[columns].apply(function)
print(f"✓ Applied function to columns: {columns}")
return self.df

def map_values(self,
column: str,
mapping: Dict
) -> pd.DataFrame:
"""
Map values in a column using a dictionary.

Args:
column: Column name to map
mapping: Dictionary of old_value: new_value pairs

Returns:
Transformed DataFrame
"""

self.df[column] = self.df[column].map(mapping)
print(f"✓ Mapped values in '{column}' using {len(mapping)} mappings")
return self.df

def bin_data(self,
column: str,
bins: List,
labels: List[str],
new_column: Optional[str] = None
) -> pd.DataFrame:
"""
Bin continuous data into categorical bins.

Args:
column: Column to bin
bins: List of bin edges
labels: List of labels for each bin
new_column: Name for binned column (default: 'column_binned')

Returns:
Transformed DataFrame
"""

target_col = new_column if new_column else f'{column}_binned'
self.df[target_col] = pd.cut(self.df[column], bins=bins, labels=labels)
print(f"✓ Binned '{column}' into {len(labels)} categories → '{target_col}'")
return self.df

def one_hot_encode(self,
columns: List[str],
drop_first: bool = False
) -> pd.DataFrame:
"""
One-hot encode categorical columns.

Args:
columns: List of categorical columns to encode
drop_first: Whether to drop first category (avoid multicollinearity)

Returns:
Transformed DataFrame with encoded columns
"""

self.df = pd.get_dummies(self.df, columns=columns, drop_first=drop_first)
print(f"✓ One-hot encoded: {columns} (drop_first={drop_first})")
return self.df

def normalize(self,
columns: List[str],
suffix: str = '_normalized'
) -> pd.DataFrame:
"""
Normalize columns to [0, 1] range using Min-Max scaling.

Args:
columns: List of columns to normalize
suffix: Suffix for new normalized columns

Returns:
Transformed DataFrame
"""

for col in columns:
min_val = self.df[col].min()
max_val = self.df[col].max()
self.df[f'{col}{suffix}'] = (self.df[col] - min_val) / (max_val - min_val)
print(f"✓ Normalized columns: {columns}")
return self.df

def standardize(self,
columns: List[str],
suffix: str = '_standardized'
) -> pd.DataFrame:
"""
Standardize columns to mean=0, std=1 using Z-score normalization.

Args:
columns: List of columns to standardize
suffix: Suffix for new standardized columns

Returns:
Transformed DataFrame
"""

for col in columns:
mean_val = self.df[col].mean()
std_val = self.df[col].std()
self.df[f'{col}{suffix}'] = (self.df[col] - mean_val) / std_val
print(f"✓ Standardized columns: {columns}")
return self.df

def log_transform(self,
columns: List[str],
suffix: str = '_log'
) -> pd.DataFrame:
"""
Apply natural logarithm transformation.

Args:
columns: List of columns to transform
suffix: Suffix for new log-transformed columns

Returns:
Transformed DataFrame
"""

for col in columns:
# Add small constant to avoid log(0)
self.df[f'{col}{suffix}'] = np.log(self.df[col] + 1e-10)
print(f"✓ Log transformed columns: {columns}")
return self.df

def exp_transform(self,
columns: List[str],
suffix: str = '_exp'
) -> pd.DataFrame:
"""
Apply exponential transformation.

Args:
columns: List of columns to transform
suffix: Suffix for new exp-transformed columns

Returns:
Transformed DataFrame
"""

for col in columns:
self.df[f'{col}{suffix}'] = np.exp(self.df[col])
print(f"✓ Exponential transformed columns: {columns}")
return self.df

def sqrt_transform(self,
columns: List[str],
suffix: str = '_sqrt'
) -> pd.DataFrame:
"""
Apply square root transformation.

Args:
columns: List of columns to transform
suffix: Suffix for new sqrt-transformed columns

Returns:
Transformed DataFrame
"""

for col in columns:
self.df[f'{col}{suffix}'] = np.sqrt(self.df[col])
print(f"✓ Square root transformed columns: {columns}")
return self.df

def reset(self) -> pd.DataFrame:
"""
Reset DataFrame to original state.

Returns:
Original DataFrame
"""

self.df = self.original_df.copy()
print("✓ Reset to original DataFrame")
return self.df

def get_dataframe(self) -> pd.DataFrame:
"""Get the current transformed DataFrame."""
return self.df

def summary(self) -> None:
"""Print summary statistics of the current DataFrame."""
print("\n" + "="*60)
print("DATAFRAME SUMMARY")
print("="*60)
print(f"Shape: {self.df.shape}")
print(f"Columns: {list(self.df.columns)}")
print("\nData types:")
print(self.df.dtypes)
print("\nFirst few rows:")
print(self.df.head())
print("="*60 + "\n")

# ============================================================================
# DEMONSTRATION: All Transformations in Action
# ============================================================================
def main():
"""
Comprehensive demonstration of all data transformation techniques.
"""

print("\n" + "="*60)
print("DATA TRANSFORMATION TOOLKIT - COMPLETE DEMONSTRATION")
print("="*60 + "\n")

# Create comprehensive sample dataset
np.random.seed(42)
sample_data = pd.DataFrame({
'age': [22, 35, 45, 28, 52, 61, 33, 29, 44, 38],
'income': [35000, 55000, 85000, 42000, 95000, 120000, 48000, 51000, 78000, 62000],
'score': [65, 72, 88, 70, 91, 95, 68, 74, 85, 79],
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
'status': ['active', 'inactive', 'active', 'pending', 'active', 'inactive', 'pending', 'active', 'inactive', 'active'],
'count': [5, 15, 35, 60, 85, 95, 12, 42, 68, 78]
})

print("Original Dataset:")
print(sample_data)
print("\n")

# Initialize transformer
transformer = DataTransformer(sample_data)

# 1. Apply function to single column (double the age)
print("\n1. APPLY FUNCTION TO SINGLE COLUMN")
print("-" * 60)
transformer.apply_function_single_column('age', lambda x: x * 2, 'age_doubled')

# 2. Apply function to multiple columns (convert to thousands)
print("\n2. APPLY FUNCTION TO MULTIPLE COLUMNS")
print("-" * 60)
transformer.apply_function_multiple_columns(['income', 'count'], lambda x: x / 1000)

# 3. Map values (map status codes)
print("\n3. MAP VALUES")
print("-" * 60)
status_mapping = {'active': 1, 'inactive': 0, 'pending': 2}
transformer.map_values('status', status_mapping)

# 4. Binning (age groups)
print("\n4. BINNING DATA")
print("-" * 60)
transformer.bin_data('age',
bins=[0, 30, 50, 70, 100],
labels=['Young', 'Middle', 'Senior', 'Elderly'],
new_column='age_group')

# 5. One-hot encoding
print("\n5. ONE-HOT ENCODING")
print("-" * 60)
transformer.one_hot_encode(['category'], drop_first=False)

# 6. Normalize
print("\n6. NORMALIZE DATA")
print("-" * 60)
transformer.normalize(['score'])

# 7. Standardize
print("\n7. STANDARDIZE DATA")
print("-" * 60)
transformer.standardize(['age_doubled'])

# 8. Log transformation
print("\n8. LOG TRANSFORMATION")
print("-" * 60)
transformer.log_transform(['income'])

# 9. Exponential transformation
print("\n9. EXPONENTIAL TRANSFORMATION")
print("-" * 60)
transformer.exp_transform(['count'])

# 10. Square root transformation
print("\n10. SQUARE ROOT TRANSFORMATION")
print("-" * 60)
transformer.sqrt_transform(['score'])

# Display final summary
print("\n" + "="*60)
print("FINAL TRANSFORMED DATASET")
print("="*60)
transformer.summary()

# Save to CSV for further use
output_file = '/home/claude/transformed_data.csv'
transformer.get_dataframe().to_csv(output_file, index=False)
print(f"✓ Saved transformed data to: {output_file}\n")

return transformer.get_dataframe()

if __name__ == "__main__":
# Run the comprehensive demonstration
final_df = main()

print("\n" + "="*60)
print("TRANSFORMATION COMPLETE!")
print("="*60)
print("\nYou can now use the DataTransformer class for your own data:")
print("\n transformer = DataTransformer(your_dataframe)")
print(" transformer.normalize(['column1', 'column2'])")
print(" result = transformer.get_dataframe()")
print("\n" + "="*60 + "\n")

Key Takeaways for Production Use:

  1. Error Handling: Add try-except blocks for robust production code
  2. Logging: Implement proper logging instead of print statements
  3. Validation: Check for null values, data types before transformations
  4. Documentation: Keep detailed docstrings for maintenance
  5. Testing: Write unit tests for each transformation method
  6. Scalability: For large datasets, consider using Dask or PySpark
  7. Pipeline Integration: This class can be integrated with sklearn pipelines
  8. Version Control: Track which transformations were applied and when

This comprehensive guide covers all the data transformation techniques with practical examples and production-ready code!

Why Transformation Requires Governance

Transformation logic frequently begins in exploratory analysis. Over time, it migrates into production pipelines. Eventually, it becomes embedded into feature stores, dashboards, regulatory reports, and AI systems.

Download the Medium app

At that point, transformation defines how reality is represented inside the organization.

When KPIs shift unexpectedly, when model performance drifts, or when regulatory reconciliation becomes difficult, reviewing transformation history often reveals the underlying cause.

Models operate within the representation we construct. Transformation defines that representation.

Closing Thoughts

Across the first five parts of this series, a consistent pattern should now be visible. Importing established structure. Inspection challenged trust. Selection narrowed the population under consideration. Cleaning intervened and reshaped distributions. Transformation now defines how that curated data will be represented.

Each stage moves further away from raw operational reality and closer to analytical abstraction.

Every derived feature carries an assumption. Every scaling choice encodes influence. Every threshold introduces a boundary. Every ratio reflects a belief about how two variables relate to each other. These decisions often feel technical, but their consequences are organizational.

Models do not create meaning independently. They optimize within the representation provided to them. When performance shifts, when KPIs drift, or when regulatory questions surface, the root cause is frequently traced not to algorithmic complexity, but to changes in representation introduced earlier in the lifecycle.

In regulated environments such as banking and financial services, this affects capital calculations, credit decisions, and reporting consistency. In insurance, it influences pricing and exposure modeling. In retail, it shapes segmentation and targeting. In healthcare, it impacts risk stratification and operational prioritization.

Transformation logic often begins as a practical step in analysis. Over time, it becomes embedded into production pipelines. Eventually, it becomes invisible infrastructure — assumed to be correct because it has existed long enough.

That is why transformation deserves the same scrutiny as modeling. It defines the space within which models operate.

In Part 6, we will move into string and text operations — an area where seemingly small inconsistencies introduce instability and ambiguity long before numeric features are even considered.

If you work in analytics, data engineering, regulated reporting, or production AI systems, this layer is worth examining carefully. And if this perspective reflects challenges you have encountered in real systems, continue the journey through the remaining parts of the series and share your own experiences. These are not theoretical concerns. They are patterns that surface repeatedly in production environments.

If this discussion resonates with your work, consider supporting the series so it reaches others operating in similar environments. A thoughtful clap helps surface responsible data practices to a wider audience. Following the series ensures you do not miss the next parts as we move deeper into production-grade data manipulation. And if you have encountered transformation decisions that later required explanation, correction, or defense, share them in the comments. The most valuable insights in data systems rarely come from theory alone — they emerge from real implementation experience.

Because the most consequential data decisions are rarely the most visible ones. They are the representational choices that quietly define what reality looks like inside our systems.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.