Proving the Convexity of Log-Loss for Logistic Regression
Last Updated on February 25, 2023 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
Unpacking Log Loss Error Functionβs Impact on Logistic Regression
Photo by DeepMind onΒ Unsplash
Author(s): PratikΒ Shukla
βCourage is like a muscle. We strengthen it by use.ββββRuthΒ Gordo
Table of Contents:
- Proof of convexity of the log-loss function for logistic regression
- A visual look at BCE for logistic regression
- Resources and references
Introduction
In this tutorial, we will see why the log-loss function works better in logistic regression. Here, our goal is to prove that the log-loss function is a convex function for logistic regression. Once we prove that the log-loss function is convex for logistic regression, we can establish that itβs a better choice for the loss function.
Logistic regression is a widely used statistical technique for modeling binary classification problems. In this method, the log-odds of the outcome variable is modeled as a linear combination of the predictor variables. To estimate the parameters of the model, the maximum likelihood method is used, which involves optimizing the log-likelihood function. The log-likelihood function for logistic regression is typically expressed as the negative sum of the log-likelihoods of each observation. This function is known as the log-loss function or binary cross-entropy loss. In this blog post, we will explore the convexity of the log-loss function and why it is an essential property in optimization algorithms used in logistic regression. We will also provide a proof of the convexity of the log-loss function.
Proof of convexity of the log-loss function for logistic regression:
Letβs mathematically prove that the log-loss function for logistic regression isΒ convex.
We saw in the previous tutorial that a function is said to be a convex function if its second derivative is >0. So, here weβll take the log-loss function and find its second derivative to see whether itβs >0 or not. If itβs >0, then we can say that it is a convex function.
Here we are going to consider the case of a single trial to simplify the calculations.
Stepβββ1:
The following is a mathematical definition of the binary cross-entropy loss function (for a singleΒ trial).
Figureβββ1: Binary Cross-Entropy loss for a singleΒ trial
Stepβββ2:
The following is the predicted value (Ε·) for logistic regression.
Figureβββ2: The predicted probability for the givenΒ example
Stepβββ3:
In the following image, z represents the linear transformation.
Figureβββ3: Linear transformation in forward propagation
Stepβββ4:
After that, we are modifying Stepβββ1 to reflect the values of Stepβββ3 and Stepβββ2.
Figureβββ4: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ5:
Next, we are simplifying the terms in Stepβββ4.
Figureβββ5: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ6:
Next, we are further simplifying the terms in Stepβββ5.
Figureβββ6: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ7:
The following is the quotient rule for logarithms.
Figureβββ7: The quotient rule for logarithms
Stepβββ8:
Next, we are using the equation from Stepβββ7 to further simplify Stepβββ6.
Figureβββ8: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ9:
In Stepβββ8, the value of log(1) is going to beΒ 0.
Figureβββ9: The value ofΒ log(1)=0
Stepβββ10:
Next, we are rewriting Stepβββ8 with the remaining terms.
Figureβββ10: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ11:
The following is the power rule for logarithms.
Figureβββ11: Power rule for logarithms
Stepβββ12:
Next, we will use the power rule of logarithms to simplify the equation in Stepβββ10.
Figureβββ12: Applying the powerΒ rule
Stepβββ13:
Next, we are replacing the values in Stepβββ10 with the values in Stepβββ12.
Figureβββ13: Using the power rule for logarithms
Stepβββ14:
Next, we are substituting the value of Stepβββ13 into Stepβββ10.
Figureβββ14: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Stepβββ15:
Next, we are multiplying Stepβββ14 by (-1) on bothΒ sides.
Figureβββ15: Binary Cross-Entropy loss for logistic regression for a singleΒ trial
Finding the First Derivative:
Stepβββ16:
Next, we are going to find the first derivative ofΒ f(x).
Figureβββ16: Finding the first derivative ofΒ f(w)
Stepβββ17:
Here we are distributing the partial differentiation sign to eachΒ term.
Figureβββ17: Finding the first derivative ofΒ f(w)
Stepβββ18:
Here we are applying the derivative rules.
Figureβββ18: Finding the first derivative ofΒ f(w)
Stepβββ19:
Here we are finding the partial derivative of the last term of Stepβββ18.
Figureβββ19: Finding the first derivative ofΒ f(w)
Stepβββ20:
Here we are finding the partial derivative of the first term of Stepβββ18.
Figureβββ20: Finding the first derivative ofΒ f(w)
Stepβββ21:
Here we are putting together the results of Stepβββ19 and Stepβββ20.
Figureβββ21: Finding the first derivative ofΒ f(w)
Stepβββ22:
Next, we are rearranging the terms of the equation in Stepβββ21.
Figureβββ22: Finding the first derivative ofΒ f(w)
Stepβββ23:
Next, we are rewriting the equation in Stepβββ22.
Figureβββ23: Finding the first derivative ofΒ f(w)
Finding the Second Derivative:
Stepβββ24:
Next, we are going to find the second derivative of the functionΒ f(x).
Figureβββ24: Finding the second derivative ofΒ f(w)
Stepβββ25:
Here we are distributing the partial derivative to eachΒ term.
Figureβββ25: Finding the second derivative ofΒ f(w)
Stepβββ26:
Next, we are simplifying the equation in Stepβββ25 to remove redundant terms.
Figureβββ26: Finding the second derivative ofΒ f(w)
Stepβββ27:
Here is the derivative rule forΒ 1/f(x).
Figureβββ27: The derivative rule forΒ 1/f(x)
Stepβββ28:
Next, we are finding the relevant term to plug-in in Stepβββ27.
Figureβββ28: Value of p(w) for derivative ofΒ 1/p(w)
Stepβββ29:
Here we are finding the partial derivative term for Stepβββ27.
Figureβββ29: Value of pβ(w) for derivative ofΒ 1/p(w)
Stepβββ30:
Here we are finding the squared term for Stepβββ27.
Figureβββ30: Value of p(w)Β² for derivative ofΒ 1/p(w)
Stepβββ31:
Here we are putting together all the terms of Stepβββ27.
Figureβββ31: Calculating the value of the derivative ofΒ 1/p(w)
Stepβββ32:
Here we are simplifying the equation in Stepβββ31.
Figureβββ32: Calculating the value of the derivative ofΒ 1/p(w)
Stepβββ33:
Next, we are putting together all the values in Stepβββ26.
Figureβββ33: Finding the second derivative ofΒ f(w)
Stepβββ34:
Next, we are further simplifying the terms in Stepβββ33.
Figureβββ34: Finding the second derivative ofΒ f(w)
Alright! So, now we have the second derivative of the function f(x). Next, we need to find out whether this will be >0 for all the values of x or not. If it is >0 for all the values of x, then we can say that the binary cross-entropy loss is convex for logistic regression.
As we can see that the following terms from Stepβββ34 are always going to be β₯0 because the square of any number is alwaysΒ β₯0.
Figureβββ35: The square of any term is always β₯0 for any value ofΒ x
Now, we need to determine whether or not the value of e^(-wx) is >0. To do that, letβs first find the range of the function e^(-wx) in the domain [-β,+β]. To further simplify the calculations, we will consider the function e^-x instead of e^-wx. Please note that scaling a function does not change the range of the function if the domain is [-β,+β]. Letβs first plot the graph of e^-x to understand itsΒ range.
Figureβββ36: Graph of e^-x for the domain of [-10,Β 10]
From the above graph we can derive the following conclusion:
- As the value of x moves towards negative infinity (-β), the value of e^-x moves towards infinityΒ (+β).
Figureβββ37: The value of e^-x as x approaches -β
2. As the value of x moves towards 0, the value of e^-x moves towardsΒ 1.
Figureβββ38: The value of e^-x as x approaches 0
3. As the value of x moves towards positive infinity (+β), the value of e^-x moves towardsΒ 0.
Figureβββ40: The value of e^-x as x approaches +β
So, we can say that the range of the function f(x)=e^-x is [0,+β]. Based on the calculations, we can say that the function f(x)=e^-wx is always going to beΒ β₯0.
Alright! So, we have concluded that all the terms of the equation in Stepβββ34 areβ₯0. Hence, we can say that the function f(x) is a convex function for logistic regression.
Important Note:
If the value of the second derivative of the function is 0, then there is a possibility that the function is neither concave nor convex. But, letβs not worry too much aboutΒ it!
A Visual Look at BCE for Logistic Regression:
The binary cross entropy function for logistic regression is givenΒ byβ¦
Figureβββ41: Binary Cross EntropyΒ Loss
Now, we know that this is a binary classification problem. So, there can be only two possible values for Yi (0 orΒ 1).
Stepβββ1:
The value of cost function whenΒ Yi=0.
Figureβββ42: Binary Cross Entropy Loss whenΒ Y=0
Stepβββ2:
Figureβββ43: Binary Cross Entropy Loss whenΒ Y=1
Now, letβs consider only one trainingΒ example.
Stepβββ3:
Now, letβs say we have only one training example. It means that n=1. So, the value of the cost function whenΒ Y=0,
Figureβββ44: Binary Cross Entropy Loss for a single training example whenΒ Y=0
Stepβββ4:
Now, letβs say we have only one training example. It means that n=1. So, the value of the cost function whenΒ Y=1,
Figureβββ45: Binary Cross Entropy Loss for a single training example whenΒ Y=1
Stepβββ5:
Now, letβs plot the function graph in Stepβββ3.
Figureβββ46: Graph of -log(1-X)
Stepβββ6:
Now, letβs plot the function graph in Stepβββ4.
Figureβββ47: Graph ofΒ -log(X)
Stepβββ7:
Letβs put the graphs in Stepβββ5 and Stepβββ6 together.
Figureβββ48: Graph of -log(1-X) andΒ -log(X)
The above graphs follow the definition of the convex function (βA function of a single variable is called a convex function if no line segments joining two points on the graph lie below the graph at any pointβ). So, we can say that the function isΒ convex.
Conclusion:
In conclusion, we have explored the concept of convexity and its importance in optimization algorithms used in logistic regression. We have demonstrated that the log-loss function is convex, which implies that its optimization problem has a unique global minimum. This property is crucial for ensuring the stability and convergence of optimization algorithms used in logistic regression. By proving the convexity of the log-loss function, we have shown that the optimization problem in logistic regression is well-posed and can be efficiently solved using standard convex optimization methods. Moreover, our proof provides a deeper understanding of the mathematical foundations of logistic regression and lays the groundwork for further research and development in thisΒ field.
Citation:
For attribution in academic contexts, please cite this workΒ as:
Shukla, et al., βProving the Convexity of Log Loss for Logistic Regressionβ, Towards AI,Β 2023
BibTex Citation:
@article{pratik_2023,
title={Proving the Convexity of Log Loss for Logistic Regression},
url={https://pub.towardsai.net/proving-the-convexity-of-log-loss-for-logistic-regression-49161798d0f3},
journal={Towards AI},
publisher={Towards AI Co.},
author={Pratik, Shukla},
editor={Binal, Dave},
year={2023},
month={Feb}
}
Proving the Convexity of Log-Loss for Logistic Regression was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI