The Non-Convexity Debate in Machine Learning
Last Updated on July 17, 2023 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
Unpacking Mean Squared Errorβs Impact on Logistic Regression
Author(s): Pratik Shukla
βIf you canβt, you must. If you must, you can.β β Tony Robbins
Table of Contents:
- Proof of Non-convexity of the Mean Squared Error function for Logistic Regression
- A visual look at MSE for logistic
- Resources and references
Introduction:
In this tutorial, we will see why it is not recommended to use the Mean Squared Error function in logistic regression. Here, our goal is to prove that the Mean Squared Error function is not a convex function for logistic regression. Once we prove that the Mean Squared Error is not a convex function, we can establish that it is not recommended to use the Mean Squared Error function in logistic regression.
Logistic regression is a popular method used for binary classification in machine learning, statistics, and other related fields. It is often assumed that the mean squared error (MSE) function, which is commonly used as the loss function for linear regression, is also convex in logistic regression. However, recent studies have shown that the MSE function is non-convex in logistic regression, leading to challenges in model optimization and convergence. In this blog post, we will delve into the proof of the non-convexity of MSE in logistic regression, and examine the implications of this result for the performance and optimization of logistic regression models
Convex Function:
When we plot the MSE loss function with respect to the weights of the logistic regression model, the curve we get is not a convex curve. So, when the curve is not convex, it is very difficult to find the global minimum. The non-convex nature of MSE with the logistic regression is because of the sigmoid activation function which is non-linear. The sigmoid function makes the relationship between the weights and errors very complex.
Proof of Non-convexity of the Mean Squared Error function for Logistic Regression:
Letβs mathematically prove that the Mean Squared Error function for logistic regression is not convex.
We saw in the previous tutorial that a function is said to be a convex function if its double derivative is β₯0. So, here we will take the Mean Squared Error and find its double derivative to see whether it is β₯0 or not. If itβs β₯0, then we can say that it is a convex function.
Step β 1:
This is the first step in forward propagation. Here we are linearly transforming the input data.
Step β 2:
Since this is a binary classification problem, we are using the sigmoid function to generate the output. We will use the output from Step β 1 to generate the results. Here we are considering only one feature (x), and the weight associated with it is denoted by w.
Step β 3:
The mean squared error (MSE) is given by the following equation. Here, we are just considering a single example with only one feature.
Step β 4:
Next, we are going to find the first derivative of our cost function f(w) with respect to w. But, as we can see in Step β 3, our cost function f(w) is a function of Ε·. So, here we will use the chain rule.
Step β 5:
In this step, we are calculating the first part of the partial derivative.
Step β 6:
In this step, we are calculating the second part of the partial derivative.
Step β 7:
To calculate the second part of the partial derivative, we will use the formula we derived in the previous chapter.
Step β 8:
In this step, we are defining the value of p(w) for Step β 7.
Step β 9:
Here we are finding the partial derivative of p(w) to substitute in Step β 7.
Step β 10:
Next, we are calculating [p(w)]Β² to substitute in Step β 7.
Step β 11:
Here we are substituting the values of Step β 9 and Step β 10 into Step β 7.
Step β 12:
Next, we are just simplifying the equation we got in Step β 11. The simplification is shown in previous chapters.
Step β 13:
Next, we are substituting the values of Step β 5 and Step β 12 into Step β 4.
Step β 14:
In this step, we are simplifying the equation in Step β 13.
Step β 15:
Next, we are rearranging the terms from Step β 14.
Step β 16:
Now, letβs say the equation in Step β 15 is given by g(w).
Step β 17:
Next, we are going to find the partial derivative of g(w) with respect to w. Note that by doing so, we are finding the second derivative of the cost function f(w).
Step β 18:
Here we are calculating the first term of the partial derivative shown in Step β 16.
Step β 19:
Next, we are using the equation we derived in Step β 12.
Step β 20:
Next, we are substituting the values of Step β 17 and Step β 18 into Step β 16.
Step β 21:
Next, we are rearranging the terms in Step β 20.
Step β 22:
Now, our goal is to find out whether the equation shown in Step β 21 is > 0 for every value of x or not. If the answer is >0 for each value of x, then it is a convex function. But, if that is not the case, then itβs not a convex function.
Next, we are going to divide the equation in Step β 21 into three parts. Then weβll check whether all three parts are >0 for all the values of x or not.
Step β 23:
We can clearly say that part β a of the equation is always going to be β₯0 for any value of x.
Step β 24:
Similarly, we can say that part β b of the equation is always going to be β₯0 for any value of x. Check the blog on Sigmoid function to have a clear understanding of this concept.
Step β 26:
Next, we need to find whether for all the values of x, part β c of the equation will be β₯0 or not.
Next, we need to find out whether the following term gives the value β₯0 for every value of x or not. If it is true then we can confidently say that the error function is convex.
Step β 27:
Here, we are just rearranging the terms of Step β 24.
Step β 28:
Now, we know that this is a binary classification problem. So, there can be only two possible values for y (0 or 1). Letβs first plug in y=0 and evaluate the equation in Step β 25.
Here we can say that the output of the above equation is β₯0 only when Ε· is in the range of [0,2/3]. If the value of Ε· is in the range of (2/3,1], then the output of the above equation will be <0. So, based on that we can say that the above equation is not producing the value of β₯0 for every input of x.
Step β 29:
Next, letβs try and evaluate the equation in Step β 25 for y=1.
Step β 30:
Here we can say that the resultant value is β₯0 only when the value of Ε· is in the range of [1/3,1]. If the value of Ε· is in the range of [0,1/3] then the given function gives the value β€0. So, we can say that the given function is not giving results β₯0 for all the values of x. Therefore, we can say that the given function is not convex.
Since the output value is not going to be β₯0 for every value of X, we can say that the Mean Squared Error function is not a convex function for logistic regression.
Important Note:
If the value of the second derivative of the function is 0, then there is a possibility that the function is neither concave nor convex. But, letβs not worry too much about it!
A Visual Look at MSE for Logistic Regression:
The Mean Squared Error function for logistic regression is given byβ¦
Now, we know that this is a binary classification problem. So, there can be only two possible values for Yi (0 or 1).
Step β 1:
The value of cost function when Yi=0.
Step β 2:
The value of the cost function when Yi=1.
Now, letβs consider only one training example.
Step β 3:
Now, letβs say we have only one training example. It means that n=1. So, the value of the cost function when Y=0,
Step β 4:
Now, letβs say we have only one training example. It means that n=1. So, the value of the cost function when Y=1,
Step β 5:
Now, letβs plot the graph of the function in Step β 3.
Step β 6:
Now, letβs plot the graph of the function in Step β 4.
Step β 7:
Letβs put the graphs in Step β 5 and Step β 6 together.
The above graphs do not follow the definition of the convex function (βA function of a single variable is called a convex function if no line segments joining two points on the graph lie below the graph at any pointβ). So, we can say that the function is not convex.
Conclusion:
In conclusion, we have explored the proof of non-convexity of mean squared error in logistic regression and its implications for model optimization. We have seen that despite the fact that the MSE function is convex for linear regression, it is non-convex for logistic regression, which can result in suboptimal or even misleading results. We have also discussed alternative loss functions that can be used in logistic regression to overcome these issues, including the cross-entropy loss and the hinge loss. As the field of machine learning continues to evolve, it is important to remain aware of the limitations and assumptions underlying our models, and to be open to exploring new approaches and techniques to improve their performance and reliability.
Citation:
For attribution in academic contexts, please cite this work as:
Shukla, et al., βProving the Non-Convexity of the Mean Squared Error for Logistic Regressionβ, Towards AI, 2023
BibTex Citation:
@article{pratik_2023,
title={The Non-Convexity Debate in Machine Learning},
url={https://pub.towardsai.net/the-non-convexity-debate-in-machine-learning-e405687b17f6},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla},
editor={Binal, Dave},
year={2023},
month={Feb}
}
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI