Decision Trees Explained With a Practical Example
Author(s): Davuluri Hemanth Chowdary
A decision tree is one of the supervised machine learning algorithms, this algorithm can be used for regression and classification problems — yet, is mostly used for classification problems. A decision tree follows a set of if-else conditions to visualize the data and classify it according to the conditions.
Before we dive deep into the working principle of the decision tree’s algorithm you need to know a few keywords related to it.
- Root Node: This attribute is used for dividing the data into two or more sets. The feature attribute in this node is selected based on Attribute Selection Techniques.
- Branch or Sub-Tree: A part of the entire decision tree is called branch or sub-tree.
- Splitting: Dividing a node into two or more sub-nodes based on if-else conditions.
- Decision Node: After splitting the sub-nodes into further sub-nodes, then it is called as the decision node.
- Leaf or Terminal Node: This is the end of the decision tree where it cannot be split into further sub-nodes.
- Pruning: Removing a sub-node from the tree is called pruning.
Working of Decision Tree
- The root node feature is selected based on the results from the Attribute Selection Measure(ASM).
- The ASM is repeated until there is a leaf node or a terminal node where it cannot be split into sub-nodes.
What is Attribute Selective Measure(ASM)?
Attribute Subset Selection Measure is a technique used in the data mining process for data reduction. The data reduction is necessary to make better analysis and prediction of the target variable.
The two main ASM techniques are
- Gini index
- Information Gain(ID3)
The measure of the degree of probability of a particular variable being wrongly classified when it is randomly chosen is called the Gini index or Gini impurity. The data is equally distributed based on the Gini index.
Mathematical Formula :
Pi= probability of an object being classified into a particular class.
When you use the Gini index as the criterion for the algorithm to select the feature for the root node.,The feature with the least Gini index is selected.
Entropy is the main concept of this algorithm which helps in determining a feature or attribute that gives maximum information about a class is called Information gain or ID3 algorithm. By using this method we can reduce the level of entropy from the root node to the leaf node.
Mathematical Formula :
‘p’, denotes the probability of E(S) which denotes the entropy. The feature or attribute with the highest ID3 gain is used as the root for the splitting.
Well I know the ASM techniques are not clearly explained in the above context
Let me explain the whole process with an example.
IT’S TIME FOR PRACTICAL EXAMPLE
Predict the loan eligibility process from given data.
The above problem statement is taken from Analytics Vidhya Hackathon.
You can find the dataset and more information about the variables in the dataset on Analytics Vidhya.
The reason I took a classification problem is that we can visualize the decision tree after training and that is not possible with regression models.
Step1: Load the data and finish the cleaning the process
Well, there are two possible ways here either fill the null values with some value or drop all the missing values(I dropped all the missing values).
Now if you look at the shape of the original data-set it is (614,13) and the new data-set after dropping the null values is (480,13).
Step 2: Take a Look at the data-set
We found there are many categorical values in the data-set.
NOTE: Decision tree does not support categorical data as features.
So the optimal step to take at this point is you can use feature engineering techniques like label encoding and one hot label encoding.
Step 3: Split the data-set into train and test sets
Why should we split the data before training a machine learning algorithm?
For detailed reasoning, please visit Sanjeev’s article regarding training, development, test, and splitting of the data.
Step 4: Build the model and fit the train set
Before we visualize the tree let us do some calculations and find out the root node by using Entropy.
Calculation 1: Find the Entropy of total dataset
p = no of positive cases(Loan_Status accepted)
n = number of negative cases(Loan_Status not accepted)
In the data set we have
p = 332 , n=148, p+n=480
here log is with base 2
Entropy = E(s) = 0.89
Calculation 2: Now find the Entropy and gain for every column
1) Gender Column
There are two types in this male(1) and female(0)
Condition 1: Male
data-set with all male’s in it and then,
p = 278, n=116 , p+n=489
Entropy(G=Male) = 0.87
Condition 2: Female
data-set with all female’s in it and then,
p = 54 , n = 32 , p+n = 86
Entropy(G=Female) = 0.95
Average Information in Gender column is
Gain(Gender) = E(s) — I(Gender)
Gain = 0.89–0.88
Gain = 0.001
2) Married Column
In this column we have yes and no values
Condition 1: Married = Yes(1)
In this split the whole data-set with Married status yes
p = 227 , n = 84 , p+n = 311
E(Married = Yes) = 0.84
Condition 2: Married = No(0)
In this split the whole data-set with Married status no
p = 105 , n = 64 , p+n = 169
E(Married = No) = 0.957
Average Information in Married column is
Gain = 0.89–0.88 = 0.001
3) Education Column
In this column, we have graduate and not graduate values.
Condition 1: Education = Graduate(1)
p = 271 , n = 112 , p+n = 383
E(Education = Graduate) = 0.87
Condition 2: Education = Not Graduate(0)
p = 61 , n = 36 , p+n = 97
E(Education = Not Graduate) = 0.95
Average Information of Education column= 0.886
Gain = 0.01
4) Self-Employed Column
In this column we have Yes or No values
Condition 1: Self-Employed = Yes(1)
p = 43 , n = 23 , p+n = 66
E(Self-Employed=Yes) = 0.93
Condition 2: Self-Employed = No(0)
p = 289 , n = 125 , p+n = 414
E(Self-Employed=No) = 0.88
Average Information in Self-Employed in Education Column = 0.886
Gain = 0.01
5) Credit Score Column
In this column we have 1 and 0 values
Condition 1: Credit Score = 1
p = 325 , n = 85 , p+n = 410
E(Credit Score = 1) = 0.73
Condition 2: Credit Score = 0
p = 63 , n = 7 , p+n = 70
E(Credit Score = 0) = 0.46
Average Information in Credit Score column = 0.69
Gain = 0.2
Compare all the Gain values
Credit Score has the highest gain so that will be used in the root node
Now let’s verify with the decision tree of the model.
Step 4: Visualize the Decision Tree
Well, it’s like we got the calculations right!
So the same procedure repeats until there is no possibility for further splitting.
Step 5: Check the score of the model
We almost got 80% percent accuracy. Which is a decent score for this type of problem statement?
Let’s get back with the theoretical information about decision tree
Assumptions made while creating the decision tree:
- While starting the training the whole data-set is considered as the root.
- The input values are preferred to be categorical.
- Records are distributed based on attribute values.
- The attributes are placed as the root node of the tree is based on statistical results.
Optimizing the performance of the trees
- max_depth: The maximum depth of the tree is defined here.
- criterion: This parameter takes the criterion method as the value. The default value is Gini.
- splitter: This parameter allows us to choose the split strategy. Best and random are available types of the split. The default value is best.
Try changing the values of those variables and you will find some changes in the model’s accuracy.
How to avoid Over-Fitting?
Use Random Forest trees.
- Decision trees are easy to visualize.
- Non-linear patterns in the data can be captures easily.
- It can be used for predicting missing values, suitable for feature engineering techniques.
- Over-fitting of the data is possible.
- The small variation in the input data can result in a different decision tree. This can be reduced by using feature engineering techniques.
- We have to balance the data-set before training the model.
I hope everything is clear about decision trees. I tried to explain it easily. A clap and follow is appreciated.
Stay tuned for more.
Decision Trees Explained With a Practical Example was originally published in Towards AI — Multidisciplinary Science Journal on Medium, where people are continuing the conversation by highlighting and responding to this story.
Published via Towards AI