Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Understanding the Essence of Metrics in Machine Learning: Measuring the Success of Algorithms
Latest   Machine Learning

Understanding the Essence of Metrics in Machine Learning: Measuring the Success of Algorithms

Last Updated on January 25, 2024 by Editorial Team

Author(s): Vladimir Artus

Originally published on Towards AI.

Photo by Ricardo Arce on Unsplash

Machine learning is a rapidly growing field that enables computers to automatically extract valuable data and make predictions based on them. But, how can we evaluate the quality of machine learning algorithms? This is where metrics in machine learning come on the scene, tools that allow us to measure and compare the performance of models.

Machine learning metrics are a crucial part of the model evaluation process. They help answer questions like: How well does the model generalize data? What levels of accuracy and recall does it achieve? What errors does it make? The answers to these questions assist us in making informed decisions about selecting models, optimizing parameters, and comparing different approaches.

In this article, we’ll explore the main metrics in machine learning that evaluate the quality of classification and regression models, along with their interpretations. We won’t dive deep into complex examples; instead, our goal is to foster an intuitive understanding of the basics, as if you were sketching metrics on a notebook sheet.

Let’s start with some key classification metrics

When I first encountered this topic, it was not immediately clear to me how they work. The multicolored squares TN FN TR FP did not make the concept intuitive, and the puzzle did not add up until I drew squares with bullseyes in my notebook and took several “shots” at them U+1F60A

So, in the binary classification problem, we input objects one by one to our trained algorithm, and it predicts whether the object belongs to the desired class or not. For example, we provide an image to the algorithm, and it determines whether the patient has pneumonia.

Image created by Author

When there was no pneumonia, and the algorithm did not trigger, this is a True Negative

When there was no pneumonia and the algorithm was triggered, it is False Positive

When there was pneumonia and the algorithm was triggered, this is True Positive

When there was pneumonia and the algorithm did not trigger, it is False Negative

Once again, we will redraw our picture with targets, superimposing shots on them.

Image created by Author

False Positives and False Negatives in statistics are referred to as Type I and Type II errors, respectively.

Image created by Author

Now, let’s delve into the metrics themselves.

Image created by Author

Accuracy — measures the ratio of correctly made predictions to the total number of objects considered. In our example, these are 3 True Negative and 1 True Positive to 9 objects.

Precision — measures the ratio of correctly made positive forecasts to all made positive forecasts. In the example, it is 1 to 3.

Recall (Sensitivity) — measures the ratio of correctly made positive predictions to the number of available positive objects. In the example, it is 1 to 4.

Specificity — measures the ratio of correctly made negative forecasts to the number of available negative objects. In the example, it’s 3 to 5.

Negative Predictive Value — reflects the ratio of correctly made negative forecasts to the total number of negative forecasts. In the example it is 3 to 6

Let’s say the model constantly gives the answer ‘yes’. In this case, we will get a recall equal to 100%, but the accuracy (precision) is likely to be low. On the other hand, if the model makes only one, but accurate positive prediction, then precision will reach 100%, but recall will be small. Thus, when evaluating a model, we can easily have a situation where high (precision) is achieved at the expense of low completeness (recall) and vice versa. At the same time, the arithmetic mean of these two metrics will look deceptively optimistic and may lead to an incorrect interpretation of the overall performance of the model. Therefore, for a more balanced assessment of models, as a rule, the harmonic mean is used, it is also an F-score

The F-score is calculated simply,

F-1 score

, and its advantage over the mean is that it does not allow the metric to grow only at the expense of one variable. This is how the f-score will behave, with one of the variables fixed. For example, we will fix recall, but since the contributions of variables are symmetric, it does not matter which of the variables to freeze.

It will be easier for someone to look at the plots, but they are already full on the Internet, and the purpose of the article is to give the reader a different perspective on these values, and perhaps for someone such a representation will be more intuitive.

change of F-score and mean, with a fixed recall = 5
change of F-score and mean, with a fixed recall = 25
change of F-score and mean, with a fixed recall = 75

At the same time, in the case when one of the variables is tens, hundreds, thousands, etc. larger than the other, the f-score will take the scale (order) of the smaller of them and will be adjusted larger, but mean would take the scale larger and adjusted smaller.

Now, let’s look at some regression metrics

For clarity, we will consider N (N=4) points a, b, c, d, although, of course, there can be any number of them. Those with index 1 will be real values, and those with index 2 will be predicted by our algorithm. It is clear that the values predicted by the algorithm lie on one straight line, and the real ones, as a rule, rarely happen

Image created by Author

If we add up the error at each point and divide it by the total number of points, we get the Mean Absolute Error (MAE). This metric allows us to measure how much our prediction deviates from the real values on average for all points.

The formula for calculating MAE in the general looks like this

Mean Absolute Error

And in our case is as follows:

MAE example on 4 points: a, b, c, d

For example, suppose that the real values are:

a1 = 5, b1 = 2, c1 = 7, d1 = 4,

and the predicted values are:

a2 = 3, b2 = 4, c2 = 5, d2 = 7.

Then the calculation of MAE will look like this:

MAE = (U+007C3–5U+007C + U+007C4–2U+007C + U+007C5–7U+007C + U+007C7–4U+007C) / 4 = (2 + 2 + 2 + 3) / 4 = 9 / 4 = 2.25

If at each point, instead of the absolute value of the error, we calculate the square of the error, then we will calculate the mean-square error (MSE)

Mean Square Error
MSE example on 4 points: a, b, c, d

In our example,

MSE = ((3–5)2 + (4–2)2 + (5–7)2 + (7–4)2) / 4 = (4 + 4 + 4 + 9) / 4 = 21 / 4 = 5.25

If we now take the square root of the resulting value, we get a root-mean-square error (RMSE).

Root Mean Square Error

In our example, we will extract the root from the already calculated RMSE

RMSE = √5.25 ≈ 2.29

Sometimes it is convenient to look at the coefficient of determination R-squared — it shows how well the model fits the data, and should be in the range from 0 to 1. A value closer to 1 indicates a better fit of the model to the data, and a value closer to 0 indicates that the model does not explain the variation in the data.

coefficient of determination

Where m is the average of real points, i.e. (a1 + b1 + c1 + d1)/N

coefficient of determination example on 4 points: a, b, c, d

In our example, we have already calculated the numerator of the fraction, it is equal to 21, and m is just the mean of all the real values

m = (5+2+7+4)/4 = 4.5,

Then R-squared = 1–21/11 ≈ — 0.909

Interestingly, it turned out to be a negative value, although earlier we said that it should be in the range from 0 to 1. In fact, technically it can be negative, and this means that a simple average will predict the values better than our model. It turns out that R2 as a whole shows how much our model predicts better than a simple average.

Another very convenient metric is Mean Absolute Percentage Error. When calculating this metric, we look at the error at each point and its magnitude relative to the actual value of the predicted value

Mean Absolute Percentage Error
MAPE example on 4 points: a, b, c, d

To better understand it, it will be convenient to look at the image again

Image created by Author

In our example

MAPE = (U+007C (5–3)/5 U+007C + U+007C (2–4)/2 U+007C + U+007C (7–5)/7 U+007C + U+007C (4–7)/4 U+007C) / 4 * 100 = (U+007C 2/5 U+007C + U+007C -2/2 U+007C + U+007C 2/7 U+007C + U+007C -3/4 U+007C) / 4 * 100 = (0.4 + 1 + 0.2857 + 0.75) / 4 * 100 = 2.4357 / 4 * 100 = 0.608925 * 100 = 60.8925 %

Thus, in this example, the value of MAPE is about 60.8925%.

If there are zeros among the real values of the desired value, then it is convenient to use the SMAPE metric (Symmetric Mean Absolute Percentage Error), which measures the percentage deviation between the predicted and real values, taking into account symmetry. SMAPE eliminates the problem of division by zero in the case when some real values are zero.

Symmetric Mean Absolute Percentage Error
SMAPE example on 4 points: a, b, c, d

In our example
SMAPE = (U+007C (5–3)U+007C / ((U+007C5U+007C + U+007C3U+007C)/2) + U+007C (2–4)U+007C / ((U+007C2U+007C + U+007C4U+007C)/2) + U+007C (7–5)U+007C / ((U+007C7U+007C + U+007C5U+007C)/2) + U+007C (4–7)U+007C / ((U+007C4U+007C + U+007C7U+007C)/2)) / 4* 100 = (2 / ((5 + 3)/2) + 2 / ((2 + 4)/2) + 2 / ((7 + 5)/2) + 3 / ((4 + 7)/2)) / 4 * 100 = (2 / (8/2) + 2 / (6/2) + 2 / (12/2) + 3 / (11/2)) / 4 * 100 = (2 / 4 + 2 / 3 + 2 / 6 + 3 / 5) / 4 * 100 ≈ 0.6733 * 100 ≈ 67.33 %

Conclusion:

Thank you so much for reading this article to the end.

I hope that this article has helped someone to consider some popular machine learning metrics from a new perspective and better understand their essence because of an intuitive explanation. Evaluating and choosing the right metrics is an integral part of the modeling process and helps us measure the quality of models.

Ultimately, the purpose of this article is to help someone look at such metrics as accuracy, precision, recall, specificity, negative predictive value, f1-score and MAE, MSE, RMSE, MAPE, SMAPE, R-squared from a new perspective and contribute to a better understanding of their role and significance in machine learning. I wish you successful application of metrics in your projects and hope that they will help you make more informed decisions and achieve more accurate and reliable results.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓