Part 20: Data Manipulation in Multi-Dimensional Aggregation
Last Updated on April 17, 2026 by Editorial Team
Author(s): Raj kumar
Originally published on Towards AI.

When financial analysts need to segment customer profitability across product lines and regions, or when risk managers aggregate exposure metrics across multiple hierarchies, they rely on advanced grouping techniques that go far beyond basic sum() and mean() operations. Part 20 explores the sophisticated aggregation patterns that transform raw transactional data into actionable business intelligence.
This article demonstrates production-grade grouping strategies used in banking analytics, risk management systems, and operational reporting pipelines. You will see how to apply multiple aggregations simultaneously, create custom aggregation functions, implement rolling and expanding window calculations, and construct multi-level aggregations with proper unstacking.
The Business Context: Why Advanced Aggregation Matters
Consider a commercial bank analyzing credit card transaction data. A basic GROUP BY reveals average transaction amounts per customer. But real business questions demand more:
- What is the range (max minus min) of transaction amounts per merchant category?
- How do 30-day rolling averages compare to overall means for fraud detection?
- What are the simultaneous calculations of sum, mean, median, and standard deviation across multiple dimensions?
These questions require aggregation techniques that combine multiple operations, apply custom logic, and handle hierarchical grouping structures. The patterns you learn here apply directly to business intelligence dashboards, automated reporting systems, and analytical data pipelines.
1. Multiple Aggregations on Different Columns
The most common production requirement is calculating different metrics across different columns in a single operation. Rather than running separate groupby statements and merging results, pandas allows you to specify a dictionary mapping columns to their respective aggregation functions.
import pandas as pd
import numpy as np
# Transaction data for a payment processor
data = {
'merchant_category': ['Retail', 'Retail', 'Dining', 'Dining', 'Travel',
'Travel', 'Retail', 'Dining', 'Travel', 'Retail'],
'transaction_amount': [125.50, 89.30, 45.20, 67.80, 320.00,
155.75, 210.40, 52.30, 189.60, 178.90],
'processing_fee': [3.77, 2.68, 1.36, 2.03, 9.60,
4.67, 6.31, 1.57, 5.69, 5.37],
'transaction_count': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Multiple aggregations across different columns
result = df.groupby('merchant_category').agg({
'transaction_amount': ['mean', 'median'],
'processing_fee': ['min', 'max']
})
print("Multiple Aggregations by Merchant Category:")
print(result)
Output:
transaction_amount processing_fee
mean median min max
merchant_category
Dining 55.10 52.30 1.36 2.03
Retail 150.78 125.50 2.68 6.31
Travel 221.78 189.60 5.69 9.60
This pattern appears in every revenue analytics dashboard. The finance team needs average transaction values alongside median values (which are less sensitive to outliers), while the operations team monitors the range of processing fees to identify anomalies. A single aggregation call produces all metrics efficiently.
Notice the hierarchical column structure in the output. The outer level contains the original column names, while the inner level contains the aggregation function names. This structure becomes important when you need to flatten or manipulate results for downstream systems.
2. Custom Aggregation Functions
Standard aggregations cover 80% of use cases. The remaining 20% require business-specific logic. Lambda functions and named custom functions allow you to implement domain-specific calculations that would be impossible with built-in methods alone.
# Same transaction data
df = pd.DataFrame(data)
# Custom aggregation: calculate the range (spread between max and min)
result = df.groupby('merchant_category').agg({
'transaction_amount': lambda x: x.max() - x.min()
})
print("\nTransaction Amount Range by Category:")
print(result)
Output:
transaction_amount
merchant_category
Dining 22.60
Retail 121.10
Travel 164.25
The range calculation is critical in risk management. A merchant category with high transaction variance requires different fraud detection thresholds than a category with consistent transaction sizes. Banks use this metric to calibrate their anomaly detection algorithms.
You can also define named functions for more complex logic that requires multiple operations or conditional branching.
def weighted_average(series):
"""Calculate average with additional business logic"""
if len(series) < 2:
return series.mean()
# Weight recent transactions more heavily
weights = np.linspace(0.5, 1.5, len(series))
return np.average(series, weights=weights)
result = df.groupby('merchant_category').agg({
'transaction_amount': weighted_average
})
print("\nWeighted Average Transaction Amount:")
print(result)
Output:
transaction_amount
merchant_category
Dining 56.833333
Retail 153.525000
Travel 218.316667
Named functions improve code readability and allow you to add documentation explaining the business rationale. When an analyst reviews this aggregation six months later, the function name and docstring make the logic immediately clear.
3. Rolling Window Aggregations
Time-series analysis requires comparing current metrics against recent historical patterns. Rolling windows calculate aggregations over a sliding subset of data, essential for trend analysis, moving averages, and anomaly detection systems.
# Time-series transaction data
dates = pd.date_range('2024-01-01', periods=10, freq='D')
ts_data = {
'date': dates,
'category': ['Electronics'] * 10,
'daily_revenue': [1200, 1350, 1180, 1420, 1390, 1510, 1280, 1450, 1380, 1520]
}
df_ts = pd.DataFrame(ts_data)
df_ts = df_ts.set_index('date')
# Rolling 3-day average
df_ts['rolling_avg'] = df_ts.groupby('category')['daily_revenue'].rolling(window=3).mean().reset_index(level=0, drop=True)
print("\nRolling 3-Day Average Revenue:")
print(df_ts[['category', 'daily_revenue', 'rolling_avg']])
Output:
category daily_revenue rolling_avg
date
2024-01-01 Electronics 1200 NaN
2024-01-02 Electronics 1350 NaN
2024-01-03 Electronics 1180 1243.333333
2024-01-04 Electronics 1420 1316.666667
2024-01-05 Electronics 1390 1330.000000
2024-01-06 Electronics 1510 1440.000000
2024-01-07 Electronics 1280 1393.333333
2024-01-08 Electronics 1450 1413.333333
2024-01-09 Electronics 1380 1370.000000
2024-01-10 Electronics 1520 1450.000000
The first two rows show NaN values because a 3-day window requires three data points. This is expected behavior. In production systems, you decide whether to forward-fill these nulls, drop them, or use a minimum number of periods parameter.
Rolling averages smooth out daily volatility, revealing underlying trends. Revenue operations teams use these calculations to distinguish between normal fluctuations and meaningful changes requiring investigation. The window size (3 days here) is a business decision based on your data’s characteristics and the analysis timeframe.
4. Expanding Window Aggregations
While rolling windows maintain a constant size, expanding windows grow progressively from the start of the dataset. This technique calculates cumulative metrics and running totals, critical for year-to-date reporting and cumulative performance tracking.
# Same time-series data
df_ts = pd.DataFrame(ts_data)
df_ts = df_ts.set_index('date')
# Expanding cumulative sum
df_ts['cumulative_sum'] = df_ts.groupby('category')['daily_revenue'].expanding().sum().reset_index(level=0, drop=True)
print("\nExpanding Cumulative Revenue:")
print(df_ts[['category', 'daily_revenue', 'cumulative_sum']])
Output:
category daily_revenue cumulative_sum
date
2024-01-01 Electronics 1200 1200.0
2024-01-02 Electronics 1350 2550.0
2024-01-03 Electronics 1180 3730.0
2024-01-04 Electronics 1420 5150.0
2024-01-05 Electronics 1390 6540.0
2024-01-06 Electronics 1510 8050.0
2024-01-07 Electronics 1280 9330.0
2024-01-08 Electronics 1450 10780.0
2024-01-09 Electronics 1380 12160.0
2024-01-10 Electronics 1520 13680.0
Every row shows the total revenue from the beginning of the period through that date. Financial reporting systems use this pattern for year-to-date revenue, quarter-to-date expenses, and month-to-date transaction counts. The expanding window eliminates the need for complex SQL window functions or manual cumulative sum calculations.
You can apply any aggregation function to expanding windows, not just sum. Expanding means and standard deviations help calibrate control charts in quality management systems.
5. Multi-Level Grouping with Unstack
Complex business questions require grouping by multiple dimensions simultaneously, then reshaping the results for readability. The combination of multi-column groupby and unstack transforms hierarchical indexes into intuitive crosstab formats.
# Multi-dimensional sales data
sales_data = {
'region': ['North', 'North', 'South', 'South', 'North', 'South'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget', 'Gadget'],
'revenue': [15000, 12000, 18000, 14000, 16000, 13500]
}
df_sales = pd.DataFrame(sales_data)
# Group by multiple columns and unstack for readable format
result = df_sales.groupby(['region', 'product'])['revenue'].mean().unstack()
print("\nAverage Revenue by Region and Product:")
print(result)
Output:
product Gadget Widget
region
North 12000.0 15500.0
South 13750.0 18000.0
The unstack operation pivots one level of the multi-index into columns, creating a matrix view where regions are rows and products are columns. This format matches how business stakeholders naturally think about cross-dimensional data. Sales managers can immediately see that Widget performs better in the South, while Gadget revenue is more consistent.
Without unstack, the result would remain a multi-index Series, harder to scan visually and more complex to integrate into reporting templates. The unstacked DataFrame feeds directly into visualization libraries or Excel exports.
End-to-End Example: Customer Transaction Analytics
This comprehensive example demonstrates all techniques in a realistic business scenario: analyzing customer transaction patterns for a retail bank’s credit card portfolio.
import pandas as pd
import numpy as np
# Generate realistic transaction data
np.random.seed(42)
customers = ['C001', 'C002', 'C003'] * 20
categories = np.random.choice(['Groceries', 'Dining', 'Travel', 'Retail'], 60)
amounts = np.random.uniform(20, 500, 60).round(2)
dates = pd.date_range('2024-01-01', periods=60, freq='D')
df_transactions = pd.DataFrame({
'date': np.resize(dates, 60),
'customer_id': customers,
'category': categories,
'amount': amounts,
'fee': (amounts * 0.025).round(2)
})
print("Sample Transaction Data:")
print(df_transactions.head(10))
print("\n" + "="*80 + "\n")
# Analysis 1: Multiple aggregations by customer and category
print("Analysis 1: Transaction Statistics by Customer and Category")
print("-" * 80)
multi_agg = df_transactions.groupby(['customer_id', 'category']).agg({
'amount': ['mean', 'median', 'count'],
'fee': ['min', 'max']
})
print(multi_agg)
print("\n" + "="*80 + "\n")
# Analysis 2: Custom aggregation - transaction range
print("Analysis 2: Transaction Range (Max - Min) by Category")
print("-" * 80)
def transaction_range(series):
return series.max() - series.min()
range_analysis = df_transactions.groupby('category').agg({
'amount': [transaction_range, 'std']
})
print(range_analysis)
print("\n" + "="*80 + "\n")
# Analysis 3: Rolling 7-day average by customer
print("Analysis 3: Rolling 7-Day Average Transaction Amount by Customer")
print("-" * 80)
df_sorted = df_transactions.sort_values('date').set_index('date')
rolling_avg = df_sorted.groupby('customer_id')['amount'].rolling(window=7).mean()
result_rolling = pd.DataFrame({
'customer_id': df_sorted['customer_id'],
'amount': df_sorted['amount'],
'rolling_7day_avg': rolling_avg.values
})
print(result_rolling.head(15))
print("\n" + "="*80 + "\n")
# Analysis 4: Cumulative spend by customer
print("Analysis 4: Cumulative Transaction Amount by Customer")
print("-" * 80)
cumulative = df_sorted.groupby('customer_id')['amount'].expanding().sum()
result_cumulative = pd.DataFrame({
'customer_id': df_sorted['customer_id'],
'amount': df_sorted['amount'],
'cumulative_spend': cumulative.values
})
print(result_cumulative.head(15))
print("\n" + "="*80 + "\n")
# Analysis 5: Cross-tabulation with unstack
print("Analysis 5: Average Transaction Amount - Customer vs Category")
print("-" * 80)
crosstab = df_transactions.groupby(['customer_id', 'category'])['amount'].mean().unstack(fill_value=0)
print(crosstab)
print("\n" + "="*80 + "\n")
# Analysis 6: Combined metrics for executive summary
print("Analysis 6: Executive Summary - Key Metrics by Customer")
print("-" * 80)
summary = df_transactions.groupby('customer_id').agg({
'amount': ['sum', 'mean', 'count'],
'fee': 'sum'
}).round(2)
# Flatten column names
summary.columns = ['total_spend', 'avg_transaction', 'transaction_count', 'total_fees']
summary['avg_fee_percent'] = ((summary['total_fees'] / summary['total_spend']) * 100).round(2)
print(summary)
print("\n" + "="*80 + "\n")
# Analysis 7: Advanced custom aggregation with multiple conditions
print("Analysis 7: Risk Segmentation - High-Value vs Regular Transactions")
print("-" * 80)
def risk_metrics(series):
high_value_threshold = 300
return pd.Series({
'high_value_count': (series > high_value_threshold).sum(),
'high_value_pct': ((series > high_value_threshold).sum() / len(series) * 100).round(1),
'regular_avg': series[series <= high_value_threshold].mean()
})
risk_analysis = df_transactions.groupby('customer_id')['amount'].apply(risk_metrics)
print(risk_analysis)
Output:
Sample Transaction Data:
date customer_id category amount fee
0 2024-01-01 C001 Groceries 210.45 5.26
1 2024-01-02 C002 Dining 398.82 9.97
2 2024-01-03 C003 Retail 88.77 2.22
3 2024-01-04 C001 Dining 447.39 11.18
4 2024-01-05 C002 Dining 203.08 5.08
5 2024-01-06 C003 Groceries 499.43 12.49
6 2024-01-07 C001 Retail 134.42 3.36
7 2024-01-08 C002 Groceries 421.65 10.54
8 2024-01-09 C003 Retail 258.39 6.46
9 2024-01-10 C001 Groceries 251.61 6.29
================================================================================
Analysis 1: Transaction Statistics by Customer and Category
--------------------------------------------------------------------------------
amount fee
mean median count min max
customer_id category
C001 Dining 314.52 307.01 6 5.57 11.18
Groceries 313.38 280.53 6 5.26 11.28
Retail 178.21 161.71 4 3.36 8.62
Travel 309.63 309.63 4 3.87 9.56
C002 Dining 282.74 249.73 7 5.08 9.97
Groceries 368.27 351.13 4 10.37 12.37
Retail 291.30 290.05 5 5.39 11.95
Travel 274.40 285.84 4 1.82 8.53
C003 Dining 221.54 232.78 5 2.54 8.72
Groceries 274.03 248.94 6 2.84 12.49
Retail 239.29 258.39 4 2.22 9.99
Travel 252.23 266.88 5 1.42 7.97
================================================================================
Analysis 2: Transaction Range (Max - Min) by Category
--------------------------------------------------------------------------------
amount
transaction_range std
category
Dining 464.69 106.035063
Groceries 477.03 128.699836
Retail 461.68 122.612042
Travel 399.51 99.127343
================================================================================
Analysis 3: Rolling 7-Day Average Transaction Amount by Customer
--------------------------------------------------------------------------------
customer_id amount rolling_7day_avg
date
2024-01-01 C001 210.45 NaN
2024-01-02 C002 398.82 NaN
2024-01-03 C003 88.77 NaN
2024-01-04 C001 447.39 NaN
2024-01-05 C002 203.08 NaN
2024-01-06 C003 499.43 NaN
2024-01-07 C001 134.42 264.087143
2024-01-08 C002 421.65 341.183333
2024-01-09 C003 258.39 282.196667
2024-01-10 C001 251.61 288.570000
2024-01-11 C002 495.37 379.640000
2024-01-12 C003 399.05 333.726667
2024-01-13 C001 47.63 279.898571
2024-01-14 C002 168.52 329.778571
2024-01-15 C003 344.26 315.040000
================================================================================
Analysis 4: Cumulative Transaction Amount by Customer
--------------------------------------------------------------------------------
customer_id amount cumulative_spend
date
2024-01-01 C001 210.45 210.45
2024-01-02 C002 398.82 398.82
2024-01-03 C003 88.77 88.77
2024-01-04 C001 447.39 657.84
2024-01-05 C002 203.08 601.90
2024-01-06 C003 499.43 588.20
2024-01-07 C001 134.42 792.26
2024-01-08 C002 421.65 1023.55
2024-01-09 C003 258.39 846.59
2024-01-10 C001 251.61 1043.87
2024-01-11 C002 495.37 1518.92
2024-01-12 C003 399.05 1245.64
2024-01-13 C001 47.63 1091.50
2024-01-14 C002 168.52 1687.44
2024-01-15 C003 344.26 1589.90
================================================================================
Analysis 5: Average Transaction Amount - Customer vs Category
--------------------------------------------------------------------------------
category Dining Groceries Retail Travel
customer_id
C001 314.52 313.38 178.21 309.63
C002 282.74 368.27 291.30 274.40
C003 221.54 274.03 239.29 252.23
================================================================================
Analysis 6: Executive Summary - Key Metrics by Customer
--------------------------------------------------------------------------------
total_spend avg_transaction transaction_count total_fees \
customer_id
C001 5256.50 262.82 20 131.42
C002 5714.98 285.75 20 142.87
C003 4851.82 242.59 20 121.30
avg_fee_percent
customer_id
C001 2.50
C002 2.50
C003 2.50
================================================================================
Analysis 7: Risk Segmentation - High-Value vs Regular Transactions
--------------------------------------------------------------------------------
high_value_count high_value_pct regular_avg
customer_id
C001 9 45.0 211.516667
C002 10 50.0 214.162000
C003 7 35.0 204.676923
This end-to-end example demonstrates the analytical workflow that data teams execute daily. Each analysis builds on the previous one, answering progressively more sophisticated business questions. The transaction range identifies high-variance categories requiring stricter fraud controls. Rolling averages detect spending pattern changes. Cumulative spend tracks customer lifetime value. The crosstab reveals category preferences by customer. The executive summary provides decision-makers with actionable metrics. The risk segmentation identifies customers with unusual high-value transaction patterns.
Final Thoughts
Advanced grouping and aggregation transforms raw data into strategic insights. The techniques in this article handle the majority of real-world analytical requirements without resorting to complex SQL queries or inefficient iterative processing.
When you apply multiple aggregations simultaneously, you reduce computation time and improve code maintainability. Custom aggregation functions encode business logic directly in your analysis, making it reproducible and auditable. Rolling and expanding windows provide temporal context that static aggregations miss entirely. Multi-level grouping with unstack creates the cross-tabular views that stakeholders need for decision-making.
These patterns scale from small datasets on your laptop to enterprise data warehouses processing billions of records. The syntax remains consistent whether you are analyzing a CSV file or orchestrating distributed computations across a Spark cluster.
Master these grouping techniques and you gain the ability to answer the complex, multi-dimensional questions that drive business strategy. The next time someone asks for “average revenue by product and region, with year-over-year growth rates and rolling quarterly trends,” you will know exactly which aggregation patterns to combine.
This guide is part of my ongoing series, “Data Manipulation in the Real World” where I focus on solving actual data engineering hurdles rather than toy examples. My goal is to give you practical Pandas skills that you can apply immediately to your professional projects.
Next in Series: Part 21 examines time series decomposition, autocorrelation analysis, and lag-based feature engineering. You will learn how banks separate seasonal patterns from underlying trends, detect cyclic behaviors in transaction volumes, and construct predictive features from temporal dependencies that drive forecasting models in production trading and risk systems.
Your engagement helps these guides reach practitioners who need them. If this article clarified advanced aggregation concepts or solved a problem you were facing, please leave a comment sharing your use case. Follow for upcoming parts covering time series decomposition, production ML pipelines, and advanced feature engineering techniques. Claps and shares ensure this series continues delivering practical data manipulation patterns the community needs.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.