From Classification to Ordinal Regression
Last Updated on September 13, 2022 by Editorial Team
Author(s): Topaz Gilad
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Unlock the Potential of YourΒ Labels
βIs a lion closer to be a giraffe or an elephant?β
It is a question no one asked. Ever. Classifying lions, elephants and giraffes is a straightforward classification task. As such, it can be mostly addressed with a cross-entropy loss. Should the task of classifying someone as a child/adult/elderly be addressed in the sameΒ way?
In this blog post, we will overview best practice approaches and papers for:
(1) Addressing ordered classification.
(2) Coarse classification labels into regression continuous predictions.
(3) How to choose.
(4) How to evaluate.
The papers reviewed in this post focus on deep learning models, but the main concepts apply and may be adapted for other ML architectures asΒ well.
Discrete Labels in a Continuous World
βThe world is continuous, but the mind is discrete.β
– David Mumford, ICMΒ 2002
We often define categories when breaking down a real-world problem into an ML-based solution. However, real target values may be continuous or at least ordered. This is something to consider and even leverage in the design of your ML model. Are you facing what seems as a classification problem? Take a moment to understand the hidden relations between your βclassesβ.
Letβs list some classification examples:
– Online ratings: 1β5 stars.
– Medical diagnosis indexβββstage 1/2/3.
– Face pose estimation [1]βββ45/90/180 degrees.
– Age estimation and soΒ on.
Know YourΒ Data
Approaching the design of an ML solution, some of the first steps areΒ :
(a) Understand what data you currently have / what data you are likely to obtain in a reasonable time. In some cases, you will have continuous measurements associated with your data. For example, blood measurements or information on exact age. More often than not, those will not be available. Manual annotation of fine-grained labels is an extremely difficult and time-consuming task. Therefore, in many cases, all you will have are coarse categorical labels. Especially where an expert human opinion is required.
(b) Explore the domain. Ask your data domain expert to clarify the relations between the target classes. Every bit of prior knowledge or assumptions they assume. Question the format of the output they ask for. For example: Can a smooth real number between 0 to 1 be a better product use than a failed/passed/excellent student score predictor?
If your classes are indeed independent, this is not the blog post for you!
However, if they are dependent, ask yourself:
(1) Do the labels have an order?
(2) Do you only care about the order, or are some labels closer to their nearest labels than others? Are the βstepsβ of distances between labelsΒ uniform?
Regression to theΒ Rescue
Logistic Regression
Most are familiar with logistic regression to discriminate between classes. While the target labels are discrete, the estimated class confidence is continuous.
Letβs consider the following case: The consequences of wrongly classifying class A as class B is not as severe as a mistake between A and C. One approach can be to use a weighted loss with the risk coefficient of each type of mistakeΒ [2].
Ordinal Regression
Another approach can be to encode the one-hot vector labels differently, as suggested in the NNRank paper [3]. For a hands-on example, I recommend Gruberβs post and Kotwaniβs. Note that if modeling as an aggregation of an ensemble of binary classifiers, inconsistencies in class predictions may occur. Cao et al. 2020 CORAL suggest a solution to achieve consistency of rankΒ [4].
Linear RegressionβββSeeing Beyond the Available Labels
Netflix launched lately a double thumbs up. By that, they expanded the 3 (dislike/indifferent/like) into 4 categories. This is a better distinction inside what used to be the βlikedβ category: liked vs. liked a lot. Imagine you have many coarse βlikedβ votes from the past and only a few new βdouble thumbsβ votes. Estimating a score of user satisfaction (instead of a class) gives you better adaptivity to the new userΒ inputs.
Qin et al BioeNet 2020 showcases a strategy to leverage coarse labels into a fine-grained linear regression [5]. One of the important things to consider when taking that path is to evaluate your inner-class order. We will discuss nextΒ how.
Another possible benefit of transforming your discrete integer target labels into real numbers is the positive effect of soft labels. As shown by the Google Brain team (2020), using soft labels not only reduced over-confidence but also improved the calibration even without temperature scalingΒ [6β8].
Make Sure You Are in the Right Direction
Shape and Location of Clusters
I am a big believer in the analysis of your DNN embedded feature space. Create a T-SNE of your test set feature space. Follow the illustrations in the figure below. With a cross-entropy loss, you expect each cluster to condense (minimize inner-class variance) and move away from one another (maximize intra-class variance). However, in ordinal regression, you expect to see the clusters in the right order of proximity in the feature space. For linear regression with an MSE loss, for example, you will not only expect to see the right order but also a continuous order between classes with less margin from one cluster to the next. If you still do see a noticeable margin, this may also indicate your test set is lacking border examples.
Relations Between Samples from the Same Category
If you can get your hands on a few finer-grained labeled samples, you can use them as a test set to see if your model generalizes well enough in the right inner-class order. For example, imagine you are trying to estimate the age of a person, but the massive amount of labeled data you have are coarse labels (0β3/4β14/teens/20 something/30 something and so on). If you have a few samples of people from the same βage bucketβ but with data of exact age label (say age 4, age 8, age 14), check if their locations in the feature space are ordered correctly and that the output predictions are ordered as youΒ expect.
Bottom Line
Think about the relations between your target classes. They may be dependant orΒ ordered.
Consider the different types of regression to gain more from yourΒ labels.
Visualize your feature space and test also the inner-class order of predicted outputs.
While when given a set of small possible classes, the classification approach may seem an obvious direction to go, considering the relation between the classes is key to boosting your ML and sometimes may be critical forΒ success!
References
[1] Beyer, L., Hermans, A. and Leibe, B., 2015, October. Biternion nets: Continuous head pose regression from discrete training labels. In German Conference on Pattern Recognition (pp. 157β168). Springer, Cham.
[2] Polat G, Ergenc I, Kani HT, Alahdab YO, Atug O, Temizel A. Class Distance Weighted Cross-Entropy Loss for Ulcerative Colitis Severity Estimation. arXiv preprint arXiv:2202.05167. 2022 FebΒ 9.
[3] Cheng, J., Wang, Z. and Pollastri, G., 2008, June. A neural network approach to ordinal regression. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (pp. 1279β1284). IEEE.
[4] Cao, W., Mirjalili, V. and Raschka, S., 2020. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140, pp.325β331.
[5] Qin, Z., Chen, J., Jiang, Z., Yu, X., Hu, C., Ma, Y., Miao, S. and Zhou, R., 2020. Learning fine-grained estimation of physiological states from coarse-grained labels by distribution restoration. Scientific Reports, 10(1),Β pp.1β10.
[6] MΓΌller, R., Kornblith, S. and Hinton, G.E., 2019. When does label smoothing help? Advances in neural information processing systems,Β 32.
[7] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International conference on machine learning (pp. 1321β1330). PMLR.
[8] Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D. and Lucic, M., 2021. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34, pp.15682β15694.
[6] MΓΌller, R., Kornblith, S. and Hinton, G.E., 2019. When does label smoothing help? Advances in neural information processing systems,Β 32.
[7] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International conference on machine learning (pp. 1321β1330). PMLR.
[8] Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D. and Lucic, M., 2021. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34, pp.15682β15694.
From Classification to Ordinal Regression was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI