How to choose your loss function — where I disagree with Cassie Kozyrkov
Last Updated on June 24, 2022 by Editorial Team
Author(s): Christian Leschinski
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
How To Choose Your Loss Function — Where I Disagree With Cassie Kozyrkov
Selecting the right loss function and evaluation metrics is important for the success of your data science project. But while a lot has been written about machine learning algorithms and overall trends in our industry, I have not come across a lot of useful advice about this topic.
In a recent article, Cassie Kozyrkov explains the difference between loss functions and evaluation metrics. When it comes to data science and analytics, Cassie is a great communicator and her contribution to our community is undeniable. But this time she has made some claims I disagree with.
In this post, I will reply to some of her statements and give you my two cents on how to select these metrics in practice.
There are a lot of options to select your loss function
Cassie’s comment: “In practice, you’ll be importing someone else’s algorithm and so you’ll have to live with whichever loss function is already implemented in there. The one they chose is the one that’s easiest to optimize, not the one that’s most meaningful to your use case.”
The argument here is that machine learning libraries do not allow you to choose your loss metric and implementing a new machine learning algorithm would be too much of an effort in practice.
The short answer to this is twofold.
- You are free to choose another machine learning library that minimizes the loss function that you want to use. For example, when you want to use a linear model, you do not need to minimize MSE, statsmodels offers you a wide selection of generalized linear models — all with different loss functions.
- Many machine learning libraries actually do allow you to specify custom loss functions. With xgboost, lightGBM, and catboost all of the most popular boosting frameworks allow specifying custom loss functions. Pytorch and Tensorflow allow you to select from a wide range of loss functions, and so on.
Indeed, sklearn does not allow to specify custom loss functions and also SparkML also does not offer much choice in that regard. But overall, there are a lot of options to build models with different loss functions and you do not have to implement anything yourself to use those.
In sum, the loss function can be changed and its selection can be very important for the quality of the model.
Most of the machine learning methods we use on the day-to-day are statistical learning methods that are based on statistical assumptions. If these are not fulfilled, the estimated model will perform badly.
For example, if you work with count data, you will most likely be better off minimizing the log-likelihood of a Poisson model than minimizing MSE. That is because MSE is the loss metric implied by the assumption that the errors are normally distributed. Normally distributed random variables can take positive and negative values, whereas count data is generally zero or positive. Furthermore, count data usually produces some very large values that would be extremely unlikely in a normal distribution.
Therefore, using MSE would lead to a severely misspecified model that will generate bad predictions — no matter which amount of data it is trained on.
In general, if one wants the best model in a statistical sense, one should choose the loss function that corresponds best to the error distribution.
The choice of your evaluation metrics should be determined by your business problem
Cassie’s comment: “Only a newbie insists on using their loss function for performance evaluation; professionals usually have two or more scoring functions in play.”
Evaluation metrics is mostly a tool for data scientists to understand different aspects of their models’ performance and they are often used to make modeling decisions.
I don’t see a reason why a loss function should not be used as one of these metrics. On the contrary, the statistical accuracy of predictions generated by the model is often critical to achieving your business goal. Since loss functions usually measure predictive accuracy, including them into the set of metrics that determine your decisions sounds like a good idea to me.
Cassie argues that an evaluation metric should be understandable by humans. This can easily be achieved for a loss function such as logloss (or maybe the likelihood from a Poisson regression) by putting it in relation to the same metric for a naive model without any features. Then you can report, for example, that your logloss is two-thirds of that of a naive model — which is basically what a “Pseudo R2” does.
In the end, the success of any real-world data science project is going to be measured in terms of its business impact. Therefore, if possible, you should judge the quality of your model based on an estimate of the business impact it is expected to create. The KPIs your stakeholders really care about are clicks, sales, revenue, or churn — not an evaluation metric like RMSE or F1 score.
In cases where it is not possible to optimize directly for your business KPIs, you have no other option but to resort to evaluation metrics. However, your decisions should be driven by those criteria that fit your business problem.
Are you trying to rank customers according to your probability to churn, but you don’t care that much about the actual probability prediction? Then AUC-ROC may be a good choice because it is essentially based on ranking.
Do you want to predict which price you should offer to a customer in order to optimize your revenue? Well, in that case, the expected revenue at a certain price is the conversion probability multiplied by the price. So you really care about getting that conversion probability right. Therefore, you may want to use logloss as your evaluation function, because it is the log-likelihood in the case of a binary decision problem and it utilizes the exact probability predictions of your model.
Differences between loss function and evaluation metric may limit your performance
As we just discussed, it may not be wrong — but indeed very reasonable to use logloss as an evaluation metric AND your loss function if this aligns with your business problem.
If your evaluation metric aligns with your business problem, it could even be ideal to use that evaluation metric as a loss function. This is because, mathematically, two different functions usually do not have the same extremum.
If you minimize logloss, your AUC-ROC will, in general, not be maximal. If you minimize MSE you will not minimize MAE. These metrics may be correlated, but the correlation will not be perfect. So, if MAE is what you and your business stakeholders care about, then please minimize MAE.
In an ideal world, you would make the action you take based on your predictions part of your model and you would directly calculate the implied (negative) clicks, sales, revenue, etc., and use them as your loss function.
Unfortunately, it is often not possible to turn the evaluation metric that is best for your business objective into a loss function. This is because evaluation metrics are often not differentiable, so they don’t lend themselves to numerical optimization easily. Therefore, many of them cannot be used as loss metrics. But if your loss function and your evaluation metric align, this is not generally a mistake but a way to improve the performance of your model. Just imagine you could directly maximize your revenue!
Nuanced understanding vs rules of thumb
Cassie’s comment: “The loss function you’ll end up leaning on is a matter of machine convenience, not appropriateness to your business problem or real-world interpretation.”
If you made it this far, you already read all my arguments why it is worth thinking carefully about the loss function you choose.
In my experience, the key to successful machine learning products is to develop a deep and nuanced understanding of the business problem and business KPIs, as well as your data structure, different machine learning algorithms and their implementations (including computational complexity), and the development efforts associated with different solutions.
As data scientists, we have to find the best solution according to the interplay of all these factors. That’s what makes our job so hard — but, at the same, time particularly interesting.
How to choose your loss function — where I disagree with Cassie Kozyrkov was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI