The Essential Guide to ML Evaluation Metrics for Regression

Author(s): Ayo Akinkugbe

Originally published on Towards AI.

The Essential Guide to ML Evaluation Metrics for Regression — Photo by Europeana on Unsplash

Introduction

Machine learning models are only as good as our ability to measure them. Though a perfect model isn’t always possible, a good enough model is. But how do we determine good enough for an ML model? This is where evaluation metrics come in to play. There are various metrics for various scenarios and sometimes specific tasks and models. In production ML systems, choosing the right metric is important. ML Models can be designed to perform a variety of tasks ranging from regression, classification, unsupervised learning, generative tasks and reinforcement learning.

Regression is a fundamental machine learning task used to predict continuous outcomes based on one or more predictor variables. Common examples of regression tasks include forecasting sales, predicting house prices, or estimating patient recovery times. Choosing the right metric to evaluate a model’s performance is important. This post provides an exhaustive exploration of regression evaluation metrics, simplifiying each with practical case studies. This post covers the following metrics:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Mean Squared Log Error (MSLE)
Root Mean Squared Error (RMSE)
Root Mean Squared Log Error (RMSLE)
Mean Absolute Percentage Error (MAPE)
Symmetric Mean Absolute Percentage Error (sMAPE)
Weighted Mean Absolute Percentage Error (wMAPE)
Mean Absolute Scaled Error (MASE)
Mean Squared Prediction Error (MSPE)
Mean Directional Accuracy (MDA)
Median Absolute Deviation (MAD)
R² Score (Coefficient of Determination)
D² Absolute Error Score
Mean Gamma Deviance (MGD)
Mean Poisson Deviance (MPD)
Explained Variance Score

Photo by Birmingham Museums Trust on Unsplash

Mean Absolute Error (MAE)

MAE measures the average absolute difference between predicted values and actual values. It gives equal weight to all errors, regardless of their direction.

Where n is the number of observations.

Imagine you’re predicting house prices. MAE tells you, on average, how many dollars off your predictions are, regardless of whether you overestimated or underestimated.

Case Study:

A real estate company built a model to predict home prices in Seattle. Their model had an MAE of $25,000. This means, on average, their predictions were off by $25,000 (either too high or too low).

Use When:

You want errors to be interpreted in the same units as the output variable
Large and small errors should be treated with equal importance
Outliers should not have an outsized influence on your evaluation

Significance:

MAE is intuitive and easy to explain to non-technical stakeholders. It’s particularly useful in business contexts where the actual magnitude of error matters, such as financial forecasting.

Mean Squared Error (MSE)

MSE measures the average of the squared differences between predicted and actual values. It penalizes larger errors more heavily than smaller ones.

If you’re predicting delivery times, MSE will penalize a prediction that’s off by 20 minutes much more than one that’s off by 5 minutes.

Case Study:

A logistics company developed a model to predict package delivery times. Their model had an MSE of 225, meaning the average squared error was 225 minutes². This indicates that some predictions had relatively large errors, which were heavily penalized by the squaring operation.

Use When:

Larger errors are more problematic than smaller ones
You’re particularly concerned about outliers
You’re doing mathematical optimization (MSE has nice mathematical properties which include continuity and a well-behaved derivative)

Significance:

MSE is widely used in statistical modeling and machine learning. Its mathematical properties make it suitable for optimization algorithms, but its squared units can make interpretation challenging for non-technical audiences.

Mean Squared Log Error (MSLE)

MSLE applies the natural logarithm to actual and predicted values before calculating the mean squared error. It penalizes underestimation more than overestimation.

If you’re predicting sales volumes, MSLE will penalize you more for predicting 50 units when the actual was 100 (underestimation) than for predicting 150 when the actual was 100 (overestimation).

Case Study:

An e-commerce platform used MSLE to evaluate their sales forecasting model. With highly variable sales volumes (from 10 to 10,000 units), MSLE helped them focus on the relative errors rather than absolute differences, which would have been dominated by high-volume products.

Use When:

Target variable spans multiple orders of magnitude
You care more about relative errors than absolute errors
Underestimation is more problematic than overestimation
Data is right-skewed

Significance:

MSLE is particularly valuable for datasets with exponential growth patterns or when the target variable has a wide range of values. It’s commonly used in sales, revenue, and count predictions.

Root Mean Squared Error (RMSE)

RMSE is simply the square root of the Mean Squared Error. It brings the error metric back to the same units as the original data.

If you’re predicting temperature, RMSE tells you the typical size of your error in degrees, but with larger errors penalized more heavily.

Case Study:

A weather forecasting service evaluated their temperature prediction model using RMSE. Their model achieved an RMSE of 2.5°C, meaning that while most predictions were close, some significant misses occurred that increased the overall error.

Use When:

You want error in the same units as your target variable (unlike MSE)
Larger errors should be penalized more than smaller ones
You need a metric that’s widely recognized in your field

Significance:

RMSE is one of the most popular regression metrics and is often the default choice in many applications. It combines the mathematical advantages of MSE with the interpretability of having the same units as the target variable.

Root Mean Squared Log Error (RMSLE)

RMSLE is the square root of the Mean Squared Log Error. It maintains the properties of MSLE but returns values in a scale that’s closer to the original data.

If you’re predicting product demand, RMSLE helps you understand your typical relative error while penalizing underestimation more heavily than overestimation. It is useful when over-forecasting is less costly than under-forecasting.

Case Study:

In a Kaggle competition for store sales prediction, RMSLE was used as the evaluation metric. This allowed competitors to focus on getting the relative scale of predictions right across both high-volume and low-volume products.

Use When:

Dealing with data that spans multiple orders of magnitude
You want to penalize underestimation more than overestimation
You want a more interpretable version of MSLE

Significance:

RMSLE is particularly important in competitions and applications where the target variable has an exponential or power-law distribution, such as sales forecasting, population predictions, or epidemic modeling.

Mean Absolute Percentage Error (MAPE)

MAPE measures the average percentage difference between predicted and actual values. It expresses error as a percentage of the actual value.

If you’re forecasting revenue, MAPE tells you the average percentage by which your predictions missed the mark.

Case Study:

A retail chain used MAPE to evaluate their revenue forecasting model. With a MAPE of 12%, they knew that, on average, their weekly revenue predictions were off by 12% of the actual revenue.

Use When:

You want to understand error in percentage terms
Comparing performance across different scales
Communicating results to business stakeholders

Significance:

MAPE is highly intuitive for business contexts where percentage errors are more meaningful than absolute errors. However, it has limitations when actual values are close to or equal to zero.

Symmetric Mean Absolute Percentage Error (sMAPE)

sMAPE is a variation of MAPE that treats over-forecasting and under-forecasting more symmetrically. It uses the average of the actual and predicted values in the denominator.

sMAPE gives you a percentage error that doesn’t unfairly penalize overestimation or underestimation.

Case Study:

In the M4 Forecasting Competition, sMAPE was one of the primary evaluation metrics. It allowed fair comparison of forecasting methods across diverse datasets with different scales and characteristics.

Use When:

You need a percentage error that treats overestimation and underestimation more equally
Your actual values might be close to or equal to zero
Comparing different forecasting methods

Significance:

sMAPE addresses some of the mathematical limitations of MAPE, particularly when dealing with values near zero or when comparing methods that tend to bias in different directions.

Weighted Mean Absolute Percentage Error (wMAPE)

wMAPE calculates the sum of all absolute errors divided by the sum of all actual values. It effectively weights errors by the magnitude of the actual values.

wMAPE gives more importance to errors in predicting larger values than smaller values.

Case Study:

A manufacturing company used wMAPE to evaluate their inventory forecasting model. This metric gave more weight to high-volume products, which had a greater impact on their overall inventory costs.

Use When:

Errors in larger values are more important than errors in smaller values
You want to avoid the division-by-zero problem in MAPE
Aggregating errors across multiple items with different scales

Significance:

wMAPE is particularly useful in supply chain, inventory management, and financial forecasting where the impact of errors is proportional to the magnitude of the values being predicted.

Mean Absolute Scaled Error (MASE)

MASE compares your model’s performance to a naive forecast (typically, using the previous value as the prediction). It scales the errors relative to the naive method’s performance.

Where the denominator is the MAE of the naive forecast.

MASE tells you how much better (or worse) your model is compared to simply using the last observed value as your prediction.

Case Study:

A financial services company evaluated their stock price prediction model using MASE. With a MASE of 0.85, they knew their model was 15% better than simply using yesterday’s price as today’s prediction.

Use When:

You want to compare performance against a simple benchmark
Dealing with time series data
Your data has seasonal patterns or trends
Other percentage-based errors (like MAPE) are problematic due to zero or near-zero values

Significance:

MASE provides a scale-free error metric that works well across different datasets and avoids the mathematical problems of some other metrics. It’s particularly valuable in time series forecasting.

Mean Squared Prediction Error (MSPE)

MSPE is essentially the same as MSE but is sometimes used specifically in the context of out-of-sample prediction evaluation. MSPE measures how well your model predicts new, unseen data points.

Case Study:

A healthcare analytics team used MSPE to evaluate their patient readmission risk model on a test dataset. The MSPE helped them understand how well their model would generalize to new patients.

Use When:

Specifically evaluating predictive performance on test data
Larger errors should be penalized more heavily
Distinguishing between in-sample fit and out-of-sample prediction

Significance:

While mathematically identical to MSE, the term MSPE emphasizes the focus on prediction performance rather than model fit, which is an important distinction in applied machine learning.

Mean Directional Accuracy (MDA)

MDA measures the percentage of times that your model correctly predicts the direction of change (up or down) compared to the previous value. For instance, if you’re predicting stock prices, MDA tells you how often your model correctly predicts whether the price will go up or down, regardless of the magnitude.

Where the result of the equality check is 1 if true and 0 if false.

Case Study:

An investment firm evaluated their market trend prediction model using MDA. With an MDA of 68%, they knew their model correctly predicted market direction more than two-thirds of the time, which was valuable for trading strategies even if the exact price predictions weren’t perfect.

Use When:

The direction of change is more important than the exact value
In financial forecasting and trading models
Evaluating trend predictions

Significance:

MDA is particularly important in financial applications where predicting the direction correctly can be more valuable than predicting the exact magnitude. It focuses on a different aspect of predictive performance than error-based metrics.

Median Absolute Deviation (MAD)

MAD measures the median of the absolute deviations from the median of the errors. It’s a robust statistic that’s less influenced by outliers than mean-based metrics.

MAD tells you the typical size of your error, but isn’t skewed by occasional very large errors.

Case Study:

A traffic prediction system used MAD to evaluate performance because occasional extreme traffic events (accidents, sports games) would skew mean-based metrics. MAD provided a more stable measure of typical prediction accuracy.

Use When:

Data contains outliers
You want a robust measure of typical error
Median is a better measure of central tendency than the mean for your data

Significance:

MAD is particularly valuable in domains where outliers are common but not the primary focus of prediction quality, such as traffic prediction, demand forecasting with occasional spikes, or any domain with heavy-tailed error distributions.

Mean Poisson Deviance (MPD)

MPD is a specialized metric based on the Poisson probability distribution. It’s appropriate for count data where the variance is expected to equal the mean.

MPD is designed specifically for evaluating predictions of count data, like the number of customer arrivals, disease cases, or product sales.

Case Study:

An epidemiology team used MPD to evaluate their model predicting the number of new disease cases across different regions. This metric was appropriate because disease counts typically follow a Poisson distribution.

Use When

Predicting count data (non-negative integers)
Variance of data is approximately equal to the mean
In fields like epidemiology, call center management, or inventory of discrete items

Significance:

MPD is derived from statistical theory and provides an appropriate loss function for Poisson-distributed data. It’s particularly important in fields where count data prediction is common.

Mean Gamma Deviance (MGD)

MGD is based on the Gamma probability distribution and is appropriate for continuous, positive data with variance proportional to the square of the mean.

MGD is designed for evaluating predictions of positive, continuous quantities where larger values have larger variability, such as insurance claim amounts or rainfall volumes.

Case Study:

An insurance company used MGD to evaluate their model for predicting claim amounts. Since larger claims naturally had more variability, MGD provided a more appropriate evaluation than metrics assuming constant variance.

Use When:

Predicting positive, continuous values
Variance of data increases with the mean
In fields like insurance, hydrology, or finance dealing with skewed distributions

Significance:

MGD is derived from statistical theory for Gamma-distributed data. It’s particularly valuable in domains where the coefficient of variation (standard deviation divided by mean) is roughly constant.

R² Score (Coefficient of Determination)

R² measures the proportion of variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1 (or can be negative for very poor models).

R² tells what percentage of the variation in the target variable the model explains. An R² of 0.7 means your model explains 70% of the variation in the data.

Case Study:

A team analyzing factors affecting house prices built a regression model with an R² of 0.82. They could confidently state that their model, which included features like square footage, neighborhood, and number of bedrooms, explained 82% of the variation in house prices in their dataset.

Use When:

You want to understand how much of the variance your model captures
Comparing different models on the same dataset
Communicating model performance to stakeholders familiar with statistics

Significance:

R² is perhaps the most widely recognized regression metric across fields. It provides an intuitive scale from 0 to 1 (though it can be negative for very poor models), making it easy to interpret. However, it can be misleadingly high when overfitting or when using many features.

D² Absolute Error Score

D² is similar to R² but uses absolute errors instead of squared errors. It measures the improvement over predicting the median (rather than the mean).

D² tells you how much better your model is compared to simply predicting the median value for every observation.

Case Study:

A healthcare researcher developed a model to predict patient recovery times. With a D² of 0.65, they could explain that their model reduced the absolute error by 65% compared to simply predicting the median recovery time for all patients.

Use When:

Data contains outliers that would overly influence R²
Median is a better central tendency measure than the mean for your data
You want a metric based on absolute errors rather than squared errors

Significance:

D² provides an alternative to R² that is more robust to outliers and may better represent model performance for skewed distributions. It’s particularly useful in fields where absolute errors are more interpretable than squared errors.

Explained Variance Score

The Explained Variance Score measures the proportion of variance in the dependent variable that is explained by the model. It’s similar to R² but focuses specifically on variance explanation.

This metric shows how much of the variability in the target variable a model captures, without penalizing systematic bias as heavily as R².

Case Study:

A climate scientist developed a model to predict temperature variations. The model had an Explained Variance Score of 0.75, indicating it captured 75% of the temperature variability, even though it consistently predicted temperatures that were slightly lower than actual (a systematic bias that would reduce R²).

Use When:

You want to focus on capturing variance patterns rather than absolute prediction accuracy
Systematic bias is less important than capturing the patterns of variation
Comparing models that might have different systematic biases

Significance:

Explained Variance provides insight into how well your model captures the patterns in your data, even if there are systematic offsets. It can be particularly useful when the pattern of variation is more important than the absolute values.

Conclusion

Regression metrics are definitely not a one-size-fits-all. The appropriate choice depends on your data characteristics, the specific problem you’re solving, and the needs of your stakeholders. By understanding the strengths, weaknesses, and appropriate use cases for each metric, you can make more informed decisions on evaluating and improving your regression models.

It’s often valuable to consider multiple metrics simultaneously, as they can provide complementary insights into a model’s performance. A model that performs well across several relevant metrics is likely to be more robust and useful in real-world applications.

It is also important to note that the field of predictive modeling continues to evolve, with new metrics and variations being developed to address specific challenges.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

The Essential Guide to ML Evaluation Metrics for Regression

Author(s): Ayo Akinkugbe

Introduction

Mean Absolute Error (MAE)

Case Study:

Use When:

Significance:

Mean Squared Error (MSE)

Case Study:

Use When:

Significance:

Mean Squared Log Error (MSLE)

Case Study:

Use When:

Significance:

Root Mean Squared Error (RMSE)

Case Study:

Use When:

Significance:

Root Mean Squared Log Error (RMSLE)

Case Study:

Use When:

Significance:

Mean Absolute Percentage Error (MAPE)

Case Study:

Use When:

Significance:

Symmetric Mean Absolute Percentage Error (sMAPE)

Case Study:

Use When:

Significance:

Weighted Mean Absolute Percentage Error (wMAPE)

Case Study:

Use When:

Significance:

Mean Absolute Scaled Error (MASE)

Case Study:

Use When:

Significance:

Mean Squared Prediction Error (MSPE)

Case Study:

Use When:

Significance:

Mean Directional Accuracy (MDA)

Case Study:

Use When:

Significance:

Median Absolute Deviation (MAD)

Case Study:

Use When:

Significance:

Mean Poisson Deviance (MPD)

Case Study:

Use When

Significance:

Mean Gamma Deviance (MGD)

Case Study:

Use When:

Significance:

R² Score (Coefficient of Determination)

Case Study:

Use When:

Significance:

D² Absolute Error Score

Case Study:

Use When:

Significance:

Explained Variance Score

Case Study:

Use When:

Significance:

Conclusion

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications