Part 5: Data Manipulation in Data Transformation

Last Updated on March 11, 2026 by Editorial Team

Author(s): Raj kumar

Originally published on Towards AI.

Part 5: Data Manipulation in Data Transformation

By the time we reach transformation in a data pipeline, the dataset usually appears stable. It has been imported with structure, inspected with skepticism, selected with intent, and cleaned through deliberate intervention. At this stage, many teams feel that the foundational risk has already been addressed. The data looks controlled. The inconsistencies have been handled. The structure makes sense.

But this is precisely where a different kind of influence begins.

Transformation is not about correcting data. It is about defining what the data will ultimately represent. Raw operational records rarely align directly with analytical meaning. Banking systems store transactions, balances, and events. Insurance platforms record claims and policy activity. Retail systems log orders and returns. Healthcare systems track encounters and treatments. None of these are analytical constructs by themselves.

Analytics operates on derived representations. Risk bands. Exposure ratios. Behavioral segments. Aggregated metrics. Engineered features.

Those representations do not emerge automatically from models. They are constructed through transformation.

And once constructed, they become structural. In banking and financial services, transformation logic influences how creditworthiness is summarized, how fraud thresholds are calculated, and how regulatory capital metrics are derived. In insurance, it shapes how exposure is measured and how premium fairness is evaluated. In retail and digital platforms, it determines how customer behavior is interpreted and prioritized. In healthcare systems, it influences how patient risk is categorized.

Transformation is often described as feature engineering. That description is incomplete. It is more accurately a layer of interpretation embedded directly into data systems.

Once transformation logic moves from experimentation into production pipelines, dashboards, and feature stores, it becomes difficult to question. The derived representation gradually replaces the raw reality.

This part of the series examines data transformation as a structural decision layer — one that shapes signal, bias, interpretability, and long-term system behavior far more than most teams initially recognize.

Data Transformation Techniques

1. Apply Function to Column

When we need to transform values in a single column, apply() with a lambda function is your go-to tool. This is useful for mathematical transformations, string manipulations, or any custom logic you need to apply row-by-row.

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({'column': [1, 2, 3, 4, 5]})

# Apply function to square each value
df['new_column'] = df['column'].apply(lambda x: x*2)
print(df)

Output:

column new_column
0 1 2
1 2 4
2 3 6
3 4 8
4 5 10

2. Apply Function to Multiple Columns

When we need to apply the same transformation to multiple columns simultaneously, you can select those columns and use apply(). This is more efficient than applying the function to each column separately.

# Create sample data with multiple columns
df = pd.DataFrame({
 'col1': [1, 2, 3, 4],
 'col2': [5, 6, 7, 8]
})

# Apply function to multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda x: x*2)
print(df)

Output:

3. Map Values

map() is perfect for replacing values based on a dictionary mapping. This is commonly used for label encoding, categorizing values, or replacing codes with meaningful names. It's cleaner and faster than multiple if-else statements.

# Create sample data
df = pd.DataFrame({'column': ['old1', 'old2', 'old1', 'old2']})

# Map old values to new values
df['column'] = df['column'].map({'old1': 'new1', 'old2': 'new2'})
print(df)

Output:

column
0 new1
1 new2
2 new1
3 new2

4. Binning

Binning (also called discretization) converts continuous numerical data into categorical bins. This is useful for grouping ages into ranges, income into brackets, or any continuous variable into meaningful categories. pd.cut() creates intervals and assigns labels.

# Create sample data
df = pd.DataFrame({'column': [5, 15, 35, 60, 85, 95]})

# Bin the data into categories
df['binned'] = pd.cut(df['column'], 
 bins=[0, 25, 50, 75, 100], 
 labels=['Low', 'Medium', 'High', 'Very High'])
print(df)

Output:

column binned
0 5 Low
1 15 Low
2 35 Medium
3 60 High
4 85 Very High
5 95 Very High

5. One-Hot Encoding

One-hot encoding converts categorical variables into binary columns (0s and 1s). Each unique category becomes its own column. This is essential for machine learning algorithms that require numerical input. pd.get_dummies() automates this process.

# Create sample data
df = pd.DataFrame({'categorical_column': ['A', 'B', 'A', 'C', 'B']})

# One-hot encode
df_encoded = pd.get_dummies(df, columns=['categorical_column'])
print(df_encoded)

Output:

categorical_column_A categorical_column_B categorical_column_C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0

6. Normalize Data

Normalization (Min-Max scaling) scales data to a range between 0 and 1. This ensures all features contribute equally to distance-based algorithms like KNN or neural networks. The formula: (x — min) / (max — min).

# Create sample data
df = pd.DataFrame({'values': [10, 20, 30, 40, 50]})

# Normalize data
df['normalized'] = (df['values'] - df['values'].min()) / (df['values'].max() - df['values'].min())
print(df)

Output:

values normalized
0 10 0.00
1 20 0.25
2 30 0.50
3 40 0.75
4 50 1.00

7. Standardize Data

Standardization (Z-score normalization) transforms data to have mean=0 and standard deviation=1. This is crucial for algorithms sensitive to feature scales like SVM, logistic regression, and PCA. Formula: (x — mean) / std.

# Create sample data
df = pd.DataFrame({'values': [10, 20, 30, 40, 50]})

# Standardize data
df['standardized'] = (df['values'] - df['values'].mean()) / df['values'].std()
print(df)

Output:

values standardized
0 10 -1.414214
1 20 -0.707107
2 30 0.000000
3 40 0.707107
4 50 1.414214

8. Log Transformation

Log transformation reduces right-skewed distributions and handles exponential growth patterns. It’s particularly useful for financial data, population growth, or any data with multiplicative relationships. Use np.log() for natural log.

# Create sample data with skewed distribution
df = pd.DataFrame({'values': [1, 10, 100, 1000, 10000]})

# Apply log transformation
df['log_transformed'] = np.log(df['values'])
print(df)

Output:

values log_transformed
0 1 0.000000
1 10 2.302585
2 100 4.605170
3 1000 6.907755
4 10000 9.210340

9. Exponential Transformation

Exponential transformation is the inverse of log transformation. It’s used to reverse log-transformed data or to model exponential growth. np.exp() raises e to the power of each value.

# Create sample data
df = pd.DataFrame({'values': [0, 1, 2, 3, 4]})

# Apply exponential transformation
df['exp_transformed'] = np.exp(df['values'])
print(df)

Output:

values exp_transformed
0 0 1.000000
1 1 2.718282
2 2 7.389056
3 3 20.085537
4 4 54.598150

10. Square Root Transformation

Square root transformation is a moderate transformation that reduces right-skewness (less aggressive than log). It’s useful for count data or when you want to reduce the impact of outliers without compressing the data as much as log does.

# Create sample data
df = pd.DataFrame({'values': [1, 4, 9, 16, 25, 100]})

# Apply square root transformation
df['sqrt_transformed'] = np.sqrt(df['values'])
print(df)

Output:

values sqrt_transformed
0 1 1.0
1 4 2.0
2 9 3.0
3 16 4.0
4 25 5.0
5 100 10.0

PRODUCTION-READY CODE

Now, here’s a comprehensive, production-grade implementation that includes all transformations:-

import pandas as pd
import numpy as np
from typing import Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')



class DataTransformer:
 """
 A comprehensive data transformation toolkit for data preprocessing.
 
 This class provides all common data transformation methods used in
 data science and machine learning pipelines.
 """
 
 def __init__(self, df: pd.DataFrame):
 """
 Initialize the DataTransformer with a DataFrame.
 
 Args:
 df: Input pandas DataFrame
 """
 self.df = df.copy()
 self.original_df = df.copy()
 
 def apply_function_single_column(self, 
 column: str, 
 function, 
 new_column: Optional[str] = None) -> pd.DataFrame:
 """
 Apply a custom function to a single column.
 
 Args:
 column: Name of the column to transform
 function: Lambda or function to apply
 new_column: Name for the new column (if None, overwrites original)
 
 Returns:
 Transformed DataFrame
 """
 target_col = new_column if new_column else column
 self.df[target_col] = self.df[column].apply(function)
 print(f"✓ Applied function to '{column}' → '{target_col}'")
 return self.df
 
 def apply_function_multiple_columns(self, 
 columns: List[str], 
 function) -> pd.DataFrame:
 """
 Apply a custom function to multiple columns simultaneously.
 
 Args:
 columns: List of column names to transform
 function: Lambda or function to apply
 
 Returns:
 Transformed DataFrame
 """
 self.df[columns] = self.df[columns].apply(function)
 print(f"✓ Applied function to columns: {columns}")
 return self.df
 
 def map_values(self, 
 column: str, 
 mapping: Dict) -> pd.DataFrame:
 """
 Map values in a column using a dictionary.
 
 Args:
 column: Column name to map
 mapping: Dictionary of old_value: new_value pairs
 
 Returns:
 Transformed DataFrame
 """
 self.df[column] = self.df[column].map(mapping)
 print(f"✓ Mapped values in '{column}' using {len(mapping)} mappings")
 return self.df
 
 def bin_data(self, 
 column: str, 
 bins: List, 
 labels: List[str],
 new_column: Optional[str] = None) -> pd.DataFrame:
 """
 Bin continuous data into categorical bins.
 
 Args:
 column: Column to bin
 bins: List of bin edges
 labels: List of labels for each bin
 new_column: Name for binned column (default: 'column_binned')
 
 Returns:
 Transformed DataFrame
 """
 target_col = new_column if new_column else f'{column}_binned'
 self.df[target_col] = pd.cut(self.df[column], bins=bins, labels=labels)
 print(f"✓ Binned '{column}' into {len(labels)} categories → '{target_col}'")
 return self.df
 
 def one_hot_encode(self, 
 columns: List[str], 
 drop_first: bool = False) -> pd.DataFrame:
 """
 One-hot encode categorical columns.
 
 Args:
 columns: List of categorical columns to encode
 drop_first: Whether to drop first category (avoid multicollinearity)
 
 Returns:
 Transformed DataFrame with encoded columns
 """
 self.df = pd.get_dummies(self.df, columns=columns, drop_first=drop_first)
 print(f"✓ One-hot encoded: {columns} (drop_first={drop_first})")
 return self.df
 
 def normalize(self, 
 columns: List[str], 
 suffix: str = '_normalized') -> pd.DataFrame:
 """
 Normalize columns to [0, 1] range using Min-Max scaling.
 
 Args:
 columns: List of columns to normalize
 suffix: Suffix for new normalized columns
 
 Returns:
 Transformed DataFrame
 """
 for col in columns:
 min_val = self.df[col].min()
 max_val = self.df[col].max()
 self.df[f'{col}{suffix}'] = (self.df[col] - min_val) / (max_val - min_val)
 print(f"✓ Normalized columns: {columns}")
 return self.df
 
 def standardize(self, 
 columns: List[str], 
 suffix: str = '_standardized') -> pd.DataFrame:
 """
 Standardize columns to mean=0, std=1 using Z-score normalization.
 
 Args:
 columns: List of columns to standardize
 suffix: Suffix for new standardized columns
 
 Returns:
 Transformed DataFrame
 """
 for col in columns:
 mean_val = self.df[col].mean()
 std_val = self.df[col].std()
 self.df[f'{col}{suffix}'] = (self.df[col] - mean_val) / std_val
 print(f"✓ Standardized columns: {columns}")
 return self.df
 
 def log_transform(self, 
 columns: List[str], 
 suffix: str = '_log') -> pd.DataFrame:
 """
 Apply natural logarithm transformation.
 
 Args:
 columns: List of columns to transform
 suffix: Suffix for new log-transformed columns
 
 Returns:
 Transformed DataFrame
 """
 for col in columns:
 # Add small constant to avoid log(0)
 self.df[f'{col}{suffix}'] = np.log(self.df[col] + 1e-10)
 print(f"✓ Log transformed columns: {columns}")
 return self.df
 
 def exp_transform(self, 
 columns: List[str], 
 suffix: str = '_exp') -> pd.DataFrame:
 """
 Apply exponential transformation.
 
 Args:
 columns: List of columns to transform
 suffix: Suffix for new exp-transformed columns
 
 Returns:
 Transformed DataFrame
 """
 for col in columns:
 self.df[f'{col}{suffix}'] = np.exp(self.df[col])
 print(f"✓ Exponential transformed columns: {columns}")
 return self.df
 
 def sqrt_transform(self, 
 columns: List[str], 
 suffix: str = '_sqrt') -> pd.DataFrame:
 """
 Apply square root transformation.
 
 Args:
 columns: List of columns to transform
 suffix: Suffix for new sqrt-transformed columns
 
 Returns:
 Transformed DataFrame
 """
 for col in columns:
 self.df[f'{col}{suffix}'] = np.sqrt(self.df[col])
 print(f"✓ Square root transformed columns: {columns}")
 return self.df
 
 def reset(self) -> pd.DataFrame:
 """
 Reset DataFrame to original state.
 
 Returns:
 Original DataFrame
 """
 self.df = self.original_df.copy()
 print("✓ Reset to original DataFrame")
 return self.df
 
 def get_dataframe(self) -> pd.DataFrame:
 """Get the current transformed DataFrame."""
 return self.df
 
 def summary(self) -> None:
 """Print summary statistics of the current DataFrame."""
 print("\n" + "="*60)
 print("DATAFRAME SUMMARY")
 print("="*60)
 print(f"Shape: {self.df.shape}")
 print(f"Columns: {list(self.df.columns)}")
 print("\nData types:")
 print(self.df.dtypes)
 print("\nFirst few rows:")
 print(self.df.head())
 print("="*60 + "\n")

# ============================================================================
# DEMONSTRATION: All Transformations in Action
# ============================================================================
def main():
 """
 Comprehensive demonstration of all data transformation techniques.
 """
 print("\n" + "="*60)
 print("DATA TRANSFORMATION TOOLKIT - COMPLETE DEMONSTRATION")
 print("="*60 + "\n")
 
 # Create comprehensive sample dataset
 np.random.seed(42)
 sample_data = pd.DataFrame({
 'age': [22, 35, 45, 28, 52, 61, 33, 29, 44, 38],
 'income': [35000, 55000, 85000, 42000, 95000, 120000, 48000, 51000, 78000, 62000],
 'score': [65, 72, 88, 70, 91, 95, 68, 74, 85, 79],
 'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C'],
 'status': ['active', 'inactive', 'active', 'pending', 'active', 'inactive', 'pending', 'active', 'inactive', 'active'],
 'count': [5, 15, 35, 60, 85, 95, 12, 42, 68, 78]
 })
 
 print("Original Dataset:")
 print(sample_data)
 print("\n")
 
 # Initialize transformer
 transformer = DataTransformer(sample_data)
 
 # 1. Apply function to single column (double the age)
 print("\n1. APPLY FUNCTION TO SINGLE COLUMN")
 print("-" * 60)
 transformer.apply_function_single_column('age', lambda x: x * 2, 'age_doubled')
 
 # 2. Apply function to multiple columns (convert to thousands)
 print("\n2. APPLY FUNCTION TO MULTIPLE COLUMNS")
 print("-" * 60)
 transformer.apply_function_multiple_columns(['income', 'count'], lambda x: x / 1000)
 
 # 3. Map values (map status codes)
 print("\n3. MAP VALUES")
 print("-" * 60)
 status_mapping = {'active': 1, 'inactive': 0, 'pending': 2}
 transformer.map_values('status', status_mapping)
 
 # 4. Binning (age groups)
 print("\n4. BINNING DATA")
 print("-" * 60)
 transformer.bin_data('age', 
 bins=[0, 30, 50, 70, 100], 
 labels=['Young', 'Middle', 'Senior', 'Elderly'],
 new_column='age_group')
 
 # 5. One-hot encoding
 print("\n5. ONE-HOT ENCODING")
 print("-" * 60)
 transformer.one_hot_encode(['category'], drop_first=False)
 
 # 6. Normalize
 print("\n6. NORMALIZE DATA")
 print("-" * 60)
 transformer.normalize(['score'])
 
 # 7. Standardize
 print("\n7. STANDARDIZE DATA")
 print("-" * 60)
 transformer.standardize(['age_doubled'])
 
 # 8. Log transformation
 print("\n8. LOG TRANSFORMATION")
 print("-" * 60)
 transformer.log_transform(['income'])
 
 # 9. Exponential transformation
 print("\n9. EXPONENTIAL TRANSFORMATION")
 print("-" * 60)
 transformer.exp_transform(['count'])
 
 # 10. Square root transformation
 print("\n10. SQUARE ROOT TRANSFORMATION")
 print("-" * 60)
 transformer.sqrt_transform(['score'])
 
 # Display final summary
 print("\n" + "="*60)
 print("FINAL TRANSFORMED DATASET")
 print("="*60)
 transformer.summary()
 
 # Save to CSV for further use
 output_file = '/home/claude/transformed_data.csv'
 transformer.get_dataframe().to_csv(output_file, index=False)
 print(f"✓ Saved transformed data to: {output_file}\n")
 
 return transformer.get_dataframe()

if __name__ == "__main__":
 # Run the comprehensive demonstration
 final_df = main()
 
 print("\n" + "="*60)
 print("TRANSFORMATION COMPLETE!")
 print("="*60)
 print("\nYou can now use the DataTransformer class for your own data:")
 print("\n transformer = DataTransformer(your_dataframe)")
 print(" transformer.normalize(['column1', 'column2'])")
 print(" result = transformer.get_dataframe()")
 print("\n" + "="*60 + "\n")

Key Takeaways for Production Use:

Error Handling: Add try-except blocks for robust production code
Logging: Implement proper logging instead of print statements
Validation: Check for null values, data types before transformations
Documentation: Keep detailed docstrings for maintenance
Testing: Write unit tests for each transformation method
Scalability: For large datasets, consider using Dask or PySpark
Pipeline Integration: This class can be integrated with sklearn pipelines
Version Control: Track which transformations were applied and when

This comprehensive guide covers all the data transformation techniques with practical examples and production-ready code!

Why Transformation Requires Governance

Transformation logic frequently begins in exploratory analysis. Over time, it migrates into production pipelines. Eventually, it becomes embedded into feature stores, dashboards, regulatory reports, and AI systems.

At that point, transformation defines how reality is represented inside the organization.

When KPIs shift unexpectedly, when model performance drifts, or when regulatory reconciliation becomes difficult, reviewing transformation history often reveals the underlying cause.

Models operate within the representation we construct. Transformation defines that representation.

Closing Thoughts

Across the first five parts of this series, a consistent pattern should now be visible. Importing established structure. Inspection challenged trust. Selection narrowed the population under consideration. Cleaning intervened and reshaped distributions. Transformation now defines how that curated data will be represented.

Each stage moves further away from raw operational reality and closer to analytical abstraction.

Every derived feature carries an assumption. Every scaling choice encodes influence. Every threshold introduces a boundary. Every ratio reflects a belief about how two variables relate to each other. These decisions often feel technical, but their consequences are organizational.

Models do not create meaning independently. They optimize within the representation provided to them. When performance shifts, when KPIs drift, or when regulatory questions surface, the root cause is frequently traced not to algorithmic complexity, but to changes in representation introduced earlier in the lifecycle.

In regulated environments such as banking and financial services, this affects capital calculations, credit decisions, and reporting consistency. In insurance, it influences pricing and exposure modeling. In retail, it shapes segmentation and targeting. In healthcare, it impacts risk stratification and operational prioritization.

Transformation logic often begins as a practical step in analysis. Over time, it becomes embedded into production pipelines. Eventually, it becomes invisible infrastructure — assumed to be correct because it has existed long enough.

That is why transformation deserves the same scrutiny as modeling. It defines the space within which models operate.

In Part 6, we will move into string and text operations — an area where seemingly small inconsistencies introduce instability and ambiguity long before numeric features are even considered.

If you work in analytics, data engineering, regulated reporting, or production AI systems, this layer is worth examining carefully. And if this perspective reflects challenges you have encountered in real systems, continue the journey through the remaining parts of the series and share your own experiences. These are not theoretical concerns. They are patterns that surface repeatedly in production environments.

If this discussion resonates with your work, consider supporting the series so it reaches others operating in similar environments. A thoughtful clap helps surface responsible data practices to a wider audience. Following the series ensures you do not miss the next parts as we move deeper into production-grade data manipulation. And if you have encountered transformation decisions that later required explanation, correction, or defense, share them in the comments. The most valuable insights in data systems rarely come from theory alone — they emerge from real implementation experience.

Because the most consequential data decisions are rarely the most visible ones. They are the representational choices that quietly define what reality looks like inside our systems.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Part 5: Data Manipulation in Data Transformation

Author(s): Raj kumar

Data Transformation Techniques

1. Apply Function to Column

2. Apply Function to Multiple Columns

3. Map Values

4. Binning

5. One-Hot Encoding

6. Normalize Data

7. Standardize Data

8. Log Transformation

9. Exponential Transformation

10. Square Root Transformation

PRODUCTION-READY CODE

Key Takeaways for Production Use:

Why Transformation Requires Governance

Closing Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Part 5: Data Manipulation in Data Transformation

Author(s): Raj kumar

Data Transformation Techniques

1. Apply Function to Column

2. Apply Function to Multiple Columns

3. Map Values

4. Binning

5. One-Hot Encoding

6. Normalize Data

7. Standardize Data

8. Log Transformation

9. Exponential Transformation

10. Square Root Transformation

PRODUCTION-READY CODE

Key Takeaways for Production Use:

Why Transformation Requires Governance

Closing Thoughts

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement