How did Binary Cross-Entropy Loss Come into Existence?

Last Updated on July 17, 2023 by Editorial Team

Author(s): Towards AI Editorial Team

Originally published on Towards AI.

Exploring the Genesis of Binary Cross-Entropy Loss Function

How did Binary Cross-Entropy Loss Come into Existence? — Image by Gerd Altmann from Pixabay

Author(s): Pratik Shukla

“You always pass failure on the way to success.” — Mickey Rooney

Introduction
Properties of the binary classification problem
Basics of Bernoulli trials
Probability of multiple independent events
Properties of logarithm
Why do we take a log of Bernoulli trails
Why do we take the negative log of Bernoulli trials
Single Bernoulli trial
Multiple independent Bernoulli trials
Scaling a function
Conclusion
References and resources

Introduction:

In this tutorial, we will derive the equation of the binary cross-entropy loss. The equation of the binary cross-entropy loss is given as follows.

Figure — 1: The equation of Binary Cross Entropy Loss

We will start with a single Bernoulli trial and make our way through the complicated mathematical formulas involved to derive the equation of the binary cross-entropy loss. Let’s start!

The Binary Cross-Entropy Loss function is a fundamental concept in the field of machine learning, particularly in the domain of deep learning. It is a mathematical function that measures the difference between predicted probabilities and actual binary labels in classification tasks. The Binary Cross-Entropy Loss function has become a staple in the training of neural networks, but its origins and development are not always well-known. In this blog, we will explore the history and evolution of the Binary Cross-Entropy Loss function, delving into its origins and discussing its applications in modern machine learning. By understanding the history of this crucial function, we can gain a better understanding of its purpose and potential for future research.

Properties of binary classification problem:

Each binary classification problem has the following properties:

Each example of the binary classification dataset belongs to one of the two complimentary classes.
The result of one example does not affect the result of any other examples. It means that each example is independent of the other.
All the examples are generated from the same underlying distribution.

In probability theory, if a dataset follows the 2nd and 3rd properties, then the examples are said to be independent and identically distributed (i.i.d). Considering the i.i.d. nature of the data helps to make the calculations much more straightforward.

Basics of Bernoulli trials:

Now, in binary classification problems, we have to predict the value only for one class because the probability of the negative class can be easily derived from it. It means that suppose we are performing a binary classification problem, and our outputs are dog or cat. We only need to predict the probability of a particular example being a dog. The probability of that specific example being a cat can be easily derived from it.

Figure — 2: Probability of an example being in class 0 and class 1

We can summarize this concept using the following formula.

Figure — 3: A summary of an example being in class 0 and class 1

Here we know that ŷ is the predicted value. A good binary classifier should produce a high value of ŷ (ideally 1) when the example has a positive label (y=1). Also, it should produce a low value of ŷ (ideally 0) when the example has a negative label (y=0). So, our goal should be the following.

Produce a high value of ŷ when y=1

Produce a low value of ŷ when y=0

Next, based on that, we can say that we need to maximize ŷ when y=1, and we need to minimize ŷ when y=0. We can say the same thing in other words as well.

Maximize ŷ when y=1

Maximize 1-ŷ when y=0

Instead of having two different equations for the positive and negative classes, we can create a simplified version that works for both classes.

Figure — 4: The equation for a single Bernoulli trial

The single-line expression shown in the above figure is known as a Bernoulli trial. We need to calculate it for each of the data points. After we have the results for each Bernoulli trial, we can get the final result by their multiplication.

Probability of multiple independent events:

For independent events, we can say the following.

Properties of logarithm:

1. Product rule:

2. Power rule:

Why do we take a log of Bernoulli trials?

Alright! So, as of now, we have arrived at the following equation.

Figure — 8: Equation for a single Bernoulli trial

Now, as we have discussed, we will have to find the value of each Bernoulli trial. After we have the value of each Bernoulli trial, we will find their product of them. Now, we know that since ŷ is the probability value, its value will always be between 0 and 1. So, when we multiply many values between 0 and 1, the final product might get infinitesimal or next to zero. So, to avoid this problem, we usually perform the log operation. Here note that the product rule of logarithms converts the multiplication into summation. So, instead of finding the product of values between 0 and 1, we find their summation. This is how we can avoid the problem of getting 0 as an output every time.

Now, let’s take an example to understand it in a better way.

We can see this problem and its solution in the following example.

Why do we take a negative log of Bernoulli trial?

Alright! So, after applying the natural logarithm to the Bernoulli Distribution equation, we get the following equation.

Figure — 9: Log of a single Bernoulli trial

Here, note that y is the actual label of the data, and y can take only one of the two possible values (0 or 1). The equation basically gives us the loss value. Remember that our ultimate goal is to find the optimum loss value (ideally 0).

Figure — 10: Summary of log of a Bernoulli trial

Based on the above equation, we can say that we have to minimize the value of log(ŷ) and log(1-ŷ). Now we know that ŷ represents the value of probability, so the value of ŷ can be in the interval of [0,1]. Now, let’s plot the graph of log(ŷ) and log(1-ŷ).

Figure — 11: Graphs of log(x) and log(1-x)

Here, our goal is to find the optimum loss value, which should ideally be 0. So, from the graphs, we can see that we have to maximize the function to find the optimum values here. We generally try to minimize the function or use the gradient descent algorithm in the machine learning algorithm. But, as shown in the above graphs, the curves are concave. So, here we’ll have to use the gradient ascent algorithm to reach the optimal points.

To keep the standard convention of using the gradient descent algorithm, we find the negative of the log function. So, instead of finding the function’s maximum, we will find the minimum of the function. Let’s see how it works.

Figure — 12: Negative log of a single Bernoulli trial

Figure — 13: Graphs of -log(x) and -log(1-x)

In the above graphs, we can see that we will find the optimum values at the minimum of the graphs. So, here we can use the gradient descent algorithm to find the optimal parameters.

Before we move on to multiple individual Bernoulli trials, let’s first see how this concept works for a single Bernoulli trial.

Single Bernoulli Trial:

Step — 1: Applying log

Here we are taking the original formula and applying a log to it.

Figure — 14: Applying log to the equation of a single Bernoulli trial

After applying the natural log to the Bernoulli Distribution, our formula is simplified into a sum of the log of probabilities.

Step — 2: Finding the negative of log

Next, we are finding the negative of the above-given formula.

Figure — 15: Finding the negative of the log of the equation of a single Bernoulli trial

Multiple Independent Bernoulli Trials:

Next, we have to apply this concept to multiple independent Bernoulli trials. As we discussed before, we will need to find the product of the probabilities of each independent Bernoulli trial.

Step — 1: Applying log

Next, we will apply the log to the product of the probabilities of each independent Bernoulli trial. Here note that using the log converts the product into summation.

Figure — 17: Log of multiple independent Bernoulli trials

Step — 2: Finding the negative of log

Next, we will find the negative log of the probabilities of each independent Bernoulli trial.

Figure — 18: Negative log of multiple independent Bernoulli trials

Scaling a Function:

Now, let’s understand how scaling a function does not change its maximum or minimum point with the help of an example. Let’s say our function is y=X². Now, we will scale this function by a factor of k. Let’s see the graphs of the original and scaled functions to find the minimum point for each.

Figure — 19: Functions with their minimum points

Here we can see that scaling a function does not change its maximum or minimum point. Now, let’s plot the graphs of these functions to look visually at them.

Figure — 20: Graphs of x², 10x², and 0.1x²

We can see that scaling a function does not change its minimum or maximum points. Now, we will divide our log-likelihood function by our dataset’s total number of examples (n).

Figure — 21: Binary Cross Entropy Loss function

The above formula is known as the Binary Cross Entropy function.

Conclusion:

In conclusion, the Binary Cross-Entropy Loss function has become a critical component of modern machine learning, particularly in deep learning applications. Its origins can be traced back to the early days of information theory and statistical learning, where researchers were interested in measuring the accuracy of classification models. Over the years, the Binary Cross-Entropy Loss function has undergone significant development and refinement, leading to its current widespread use in training neural networks. By understanding the history and evolution of this important function, we can gain a deeper appreciation for its purpose and potential for future research. As machine learning continues to evolve, it is likely that the Binary Cross-Entropy Loss function will continue to play a crucial role in advancing the field.

Citation:

For attribution in academic contexts, please cite this work as:

Shukla, et al., “How did Binary-Cross Entropy Loss Come into Existence?”, Towards AI, 2023

BibTex Citation:

@article{pratik_2023, 
 title={How did Binary Cross-Entropy Loss Come into Existence?}, 
 url={https://pub.towardsai.net/probability-vs-likelihood-a79335c985f7}, 
 journal={Towards AI}, 
 publisher={Towards AI Co.}, 
 author={Pratik, Shukla},
 editor={Binal, Dave}, 
 year={2023}, 
 month={Mar}
}

References and Resources:

Binary Entropy Function

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

How did Binary Cross-Entropy Loss Come into Existence?

Author(s): Towards AI Editorial Team

Exploring the Genesis of Binary Cross-Entropy Loss Function

Table of Contents:

Introduction:

Properties of binary classification problem:

Basics of Bernoulli trials:

Probability of multiple independent events:

Properties of logarithm:

1. Product rule:

2. Power rule:

Why do we take a log of Bernoulli trials?

Why do we take a negative log of Bernoulli trial?

Single Bernoulli Trial:

Step — 1: Applying log

Step — 2: Finding the negative of log

Multiple Independent Bernoulli Trials:

Step — 1: Applying log

Step — 2: Finding the negative of log

Scaling a Function:

Conclusion:

Citation:

BibTex Citation:

References and Resources:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement