Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Into the Logistic Regression
Latest

Into the Logistic Regression

Last Updated on April 4, 2021 by Editorial Team

Author(s): Satsawat Natakarnkitkul

DATA SCIENCE, MACHINEΒ LEARNING

Break down the concept of logistic regression and one-vs-all and one-vs-one approaches for multi-class classification

Previously I have written in-depths of linear regression both closed-form (equations) and gradient descent. You can read them from the belowΒ URL.

Closed-form and Gradient Descent Regression Explained with Python

In this article, I will focus on the Logistic Regression algorithm, break down the concept, think like a machine, and have a look at the concept behind multi-class classifiers using logistic regression.

Linear regression revisited…

Before driving to logistic regression, let’s quickly recap on linear regression.

Linear regression equation

The linear regression model works well for a regression problem, where the dependent variable is continuous and fails for classification because it treats the classes as numbers (0 and 1) and fits the best hyperplane that minimizes the distances between the points and the hyperplane, hence reduces the error (you can think the hyperplane as model equations for simplicity sake).

Linear regression in different problems (a) left figure is with regression (b) right figure is with a classification problem

As it can be seen, the prediction of linear regression can go (-∞, ∞). Moreover, it doesn’t give the probabilities as theΒ output.

This is where log-odds comeΒ in.

But what are odds and log-odds…

As mentioned earlier, linear regression cannot compute the probability. However, if we looked at probability, it can be easily converted from log-odds, by computing the logarithm of the odds ratio. Probability, odds ratios and log odds are all the same but expressed in different ways.

Probability is the likelihood of an event happens (i.e. 60% chance of students passed the MathΒ exam).

Odds, or odds of success or odds ratio, is a measure of association between an exposure and an outcome, or to simply put, it is the probability of success compared to the probability of failure (from the above example, that’s 0.6/0.4 =Β 1.5).

Log odds is the logarithm of the odds (ln(1.5) β‰…Β 0.405).

Remark: in the normal computation, you may use logarithm in any base, but need to be consistent.

Equation and conversion of probability to odds ratio andΒ log-odds

We can visualize the odds ratio against probability and log-odds against the odds ratio. Observe how odds ratio can range [0, ∞), whereas log odds can range between (-∞, ∞).

Visualization of (a) odds vs. probability and (b) log-odds vs.Β odds

Now let’s quickly see how we can interpret the oddsΒ ratio;

  • OR = 1, exposure does not affect odds of theΒ outcome
  • OR > 1, exposure associated with higher odds of theΒ outcome
  • OR < 1, exposure associated with lower odds of theΒ outcome

Using the above example, the odds of students passed math exams are 1.5 times as large as the odds of students failed math exams. Remember this is not the same as being 1.5 times as probable!

Back to Logistic Regression

I assume that we all know the definition of logistic regression, logistic model (or logit model) is the statistical method to model the probability of a certain class or event (i.e. students passing the exam, customer churn). In this model, we assume a linear relationship between the independent variables and the log-odds for the dependent variable, which can then represent as below equation.

Logistic equation

The function that converts log-odds to probability is the logistic function and the unit of measurement for the log-odds scale is called logit, which is from logistic unit (hence, theΒ name).

So from the equation above, ultimately, we try to predict the left path of the equation (not the right) because p(y=1|x) is what we want. So we can take the inverse of this logit function, then we will get something similar toΒ us.

Inverse logitΒ function

The above equation is the common sigmoid function, logistic sigmoid, which returns the class probability of p(y=1|x) from the inputs b0 + b1x1 + …

Logistic curve

As the logistic curve implied in the y-axis, this mapping is probabilities (0, 1). Ultimately, the reason we use the logarithm of an odds is it can take any positive or negative value (as previously plotted). Hence, logistic regression is a linear model for the log-odds.

Note that in the logistic equation, the parameters are chosen to maximize the likelihood of observing the sample values (or MLEβ€Šβ€”β€ŠMaximize Likelihood Estimation) instead of minimizing the sum of squared errors (or LSEβ€Šβ€”β€ŠLeast Square Estimation).

Recap: comparison between linear and logistic regression

Key differences between linear and logistic regression

What if we have more than two classes toΒ predict…

What I have discussed so far is only for binary logistic regression (dependent variable consists of only two classesβ€Šβ€”β€Šyes or no, 0 or 1). In this section, we will go through multinomial logistic regression, or we may know it as multi-class logistic regression.

Visualization of 4-class classification

To break this down, we already have a binary classification model (logistic regression), eventually, we can split the multi-class dataset into multiple binary classification datasets and fit the model toΒ each.

How does it work then …

Well.. have you ever in a situation where you need to make a guess (let’s imagine in the examinationβ€Šβ€”β€Šmultiple choices exam). You read the question, compare the choice, cut out the non-sense, compare the remaining choices and pickΒ one.

Actually, this can be applied to how the binary classification model works against multi-class problems. So it depends on how you make the comparison, assuming we have four classes: A, B, C, and D. You may choose to compare by A vs. [B, C, D] or compare A to each individual class (i.e. A vs B, A vs C). These two comparisons actually have names and they are One-vs-All and One-vs-One strategies.

One-vs-All Strategy

One-vs-All, OvA (or One-vs-Rest, OvR) is one of the strategies for using binary classification algorithms for multi-class classification.

In this approach, we will focus on one class as a positive class and the other classes are assumed to be a negative class (think of comparing one against the rest of the classes).

Visualization of one-vs-all strategy

For example, given a multi-class problem with four classes (as per the above visualization), then we can divide these into four binary classification datasets;

  • 1: blue vs [green, black,Β red]
  • 2: green vs [blue, black,Β red]
  • 3: black vs [blue, green,Β red]
  • 4: red vs [blue, green,Β black]

This strategy requires that each model predicts a probability-like score. The argmax of these scores is then used to predict a class. This strategy is normally used for algorithms suchΒ as;

  1. Logistic regression
  2. Perception
  3. Deep neural network with softmax function as an outputΒ layer

As such, the scikit-learn library implements the OvA / OvR by default when using these algorithms to solve multi-class problems.

A couple of possible downsides of this approach is when it handles very large datasets.

One-vs-One strategy

Similar to the One-vs-All strategy, the one-vs-one strategy is a method for binary classification for multi-class problems by splitting the dataset into binary classification datasets. Unlike the one-vs-all strategy, one-vs-one splits the dataset onto two specific classes, an example is illustrated below.

Visualization of one-vs-one strategy

Because of this split, there are more binary classification models than one-vs-all strategy;

  1. blue vsΒ green
  2. blue vsΒ red
  3. blue vsΒ black
  4. red vsΒ green
  5. red vsΒ black
  6. green vsΒ black

The number of models generated = (n_classes * (n_classes -1))/2

Once all the models are generated, the point will be tested to all models and recorded how many times a class is preferred with the other classes. The class having the majority of votesΒ wins.

Normally, this strategy is suggested for support vector machines and related kernel-based algorithms because the performance of the kernel method does not scale in proportion to the size of the training dataset, and using only subsets of the training data may counter thisΒ effect.

So which strategy isΒ better?

As outlined above, a one-vs-all strategy can be challenging to deal with large datasets as we still use all of the data multiple times. However, the one-vs-one strategy splits the full dataset into a binary classification for each pair of classes (see the visualization on OvA and OvOΒ above).

The one-vs-all strategy trains a fewer number of classifiers, making it a faster option. However, the one-vs-one strategy is less prone to create an imbalance in theΒ dataset.

Conclusion

In this article, I have a walkthrough;

  • review the linear regression;
  • the basic concept of log-odds and why it is beingΒ used;
  • deeper dive into the concept of logistic regression;
  • one-vs-all and one-vs-one strategy for the multi-class classifier.

Hopefully, you gain more knowledge and the background of logistic regression to extend your thoughts and build upon those (instead of just importing the logistic regression library/package and useΒ it).

Reference and ExternalΒ Links


Into the Logistic Regression was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓