A quick Introduction to Machine Learning. Part-4 (Final Part)

Last Updated on July 17, 2023 by Editorial Team

Author(s): Abhijith S Babu

Originally published on Towards AI.

In this machine learning series, we have come across various machine learning techniques. In part-3, we saw how decision trees can be used in machine learning. The drawbacks of the ID-3 algorithms were pointed out. In this article, We will go through some more decision tree algorithms to address those drawbacks.

A quick Introduction to Machine Learning. Part-4 (Final Part)

A decision tree algorithm serves the purpose of choosing the best variable to split the dataset and the optimal splitting point. The ID-3 algorithm succeeded in choosing the best variable, but splitting the variable optimally was not possible. If the variable has continuous values, a common approach is to divide the set of values into small groups of equal intervals. Splitting the variables into these large groups is computationally complex. Another problem in machine learning is that the data available to us might not be complete. To handle all these problems, an enhanced form of ID-3 was introduced — the C4.5 algorithm.

C4.5 algorithm

In C4.5, the splitting variable is selected similar to the ID-3 algorithm. In ID-3, we chose the variable with minimum entropy, as it has the least impurity. Once we split the data, the entropy will be less than the total entropy before splitting. That difference is the information that we passed on to the decision tree. This is known as the information gain of that variable.
The total split information of the data can be calculated using the formula.

A variable is chosen in C4.5 based on the gain ratio, that is the ratio of information gain of the variable to the split info of the variable. The variable with the maximum gain ratio is selected as the splitting attribute.
Now that we have chosen the splitting variable, our next task is to split the variable. Categorical variables can be split according to the categories, but how to split continuous variables?
Consider our trip example with a small modification. Instead of whether there is an exam, the data shows the number of days left for the upcoming exam.

Now, find all the distinct values in the variable. Here the variable contains the values 6, 7, 8, and 10. We can now split according to each of these values and find the gain ratio of each split. Choose the value that gave the maximum gain ratio to make the split.

The CART algorithm is another technique used to train decision trees. The algorithm uses the Gini index as a measure of impurity. The Gini index of a variable can be calculated using the formula.

The CART algorithm usually creates a binary tree. The values of categorical values are divided into two groups for splitting. This division is done based on the Gini index of the data. Creating binary decision trees is helpful in easy interpretation and reduces the complexity of testing the data.
By closely examining all the values in a variable, CART can efficiently identify the outliers and imbalances and segregate them into a new sub-tree. Thus this algorithm is useful in dealing with improper data.

Until now, we have seen various techniques in machine learning. As we discussed earlier, the predictions in machine learning will not be 100% accurate. To prevent underfitting, we have to make sure that we use enough data for training. But using a large amount of data for training might lead to overfitting. To address this issue, let us come up with a technique known as regularization. Regularization reduces the overfitting of training data by adding small constraints in the learning process. This makes the machine learning model more generalized.

So far in this series, we have seen a drop in the ocean of machine learning. There are a lot more techniques and theories in machine learning. This series covered the basic topics that will introduce you to the machine learning world. Follow me to read more interesting articles in the field of artificial intelligence. Give your doubts and suggestions in the response and they will be considered in the upcoming parts. Happy reading!!!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

A quick Introduction to Machine Learning. Part-4 (Final Part)

Author(s): Abhijith S Babu

C4.5 algorithm

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

A quick Introduction to Machine Learning. Part-4 (Final Part)

Author(s): Abhijith S Babu

C4.5 algorithm

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement