Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Mastering the Basics: How Decision Trees Simplify Complex Choices
Artificial Intelligence   Data Science   Latest   Machine Learning

Mastering the Basics: How Decision Trees Simplify Complex Choices

Last Updated on March 4, 2025 by Editorial Team

Author(s): Miguel Cardona Polo

Originally published on Towards AI.

β€œTrees playing Baseball” by author using DALLΒ·E 3.

Decision trees form the backbone of some of the most popular machine learning models in industry today, such as Random Forests, Gradient Boosted Trees, and XGBoost.

Large Language Models (LLMs) are an exciting and very useful tool, but most real-world industry are not solved using LLMs. Instead, the majority of machine learning applications deal with structured, tabular data, such as large CSVs, Excel files, and databases. It is estimated that 70-80% of these tabular data tasks are solved using gradient boosting techniques like XGBoost, which rely on simple yet incredibly powerful decision trees.

One of the biggest advantages of decision trees is their interpretability. Unlike modern black-box models, decision trees provide clear, step-by-step reasoning behind predictions. This transparency helps businesses understand their data better, make smarter decisions, and move beyond just predictions.

In this article, you’ll gain a deep understanding of how decision trees work, including:

  • The math behind decision trees (optional for those interested).
  • Python code to build your own decision tree from scratch.
  • Two hands-on examples (regression & classification) with step-by-step calculations, showing exactly how a decision tree learns.

Don’t miss these detailed walkthroughs to solidify your understanding!

Concept of Decision Trees

A decision tree is like a flowchart used to make decisions. It starts at a single point (called the root node) and splits into branches based on questions about the data. At each step, the tree asks a question like β€œis the value greater than X?” or β€œdoes it belong to category Y”. Based on the answer, it moves down a branch to the next question (called the decision nodes).

This process continues until the data reaches a final point (called a leaf) which gives the decision or prediction β€” this could be a β€œYes/No”, a specific class, or a continuous number.

Take a look at this decision tree used to predict which students will pass an exam, it's based on the number of hours the student studied, their hours of sleep the day before the exam, and their previous grade.

Flow chart of decision tree used to predict which students will pass an exam. Image by author.

Each leaf node represents a group of data points that have similar characteristics and therefore are given the same prediction (Pass or Fail). For example, students who have studied between 2 to 6 hours, and have slept more than 6, are a similar group of students (from what was seen in the training data), and therefore the decision tree predicts they’ll pass the exam.

Note that decisions can be made both on numerical data, like the hours slept, and on categorical data, like the previous grade achieved by the student. This is why decision trees are so popular in tabular data, such as spreadsheets and databases, as these can contain both types.

If you are wondering how this flowchart is translated into data, we can plot it into a graph. The hours of sleep and study are represented as axes, and the previous grade as a cross (failed previous exam) or a circle (passed previous exam). We can place in the graph some example students for which we want to predict their next exam grade, the position of the crosses and circles in the graph indicate the hours of sleep and study for that student.

Partitioned graph of decision tree to predict which students will pass an exam. Image by author.

You can check that the following graph represents the same decision tree as the flowchart does, where the blue dashed lines are the decision boundaries (thresholds) and each highlighted section represents a leaf node from the decision tree.

There’s an area left unhighlighted as the prediction under those conditions is based on the student’s last exam, so only those students who passed their last exam are predicted to pass.

Now let’s look into how decision trees choose the questions and the numbers (thresholds on features) that make it an accurate prediction model.

How Decision Trees Learn

As mentioned earlier, the goal is to split the data into smaller groups, so that similar data points are grouped together. Decision trees do this by asking questions and using thresholds (numbers or categories) on the training data.

A split in a decision tree is a point where the data is divided based on a specific feature and threshold, creating branches. For example, in the case discussed earlier, one feature was the number of hours a student slept, with a threshold of β€˜less than 2 hours.’ This split created a branch grouping students who slept less than two hours. These are predicted to fail their next exam.

To choose the best split decision trees attempt all possible splits (features and thresholds) and pick the one with the lowest impurity, a value that indicates how mixed or diverse the data in a group is. Lower impurity means the group has similar data, which is the aim of the learning process.

Impurity Measure

It’s named impurity measure because it captures the diversity in a group. For example, if you have a basket with fruits, and it only contains apples, there is no diversity, the basket is pure β€” therefore the impurity is low. On the other hand, if the basket has a mix of apples, oranges, and bananas, it has a high diversity and therefore a high impurity.

There are impurity measures specific for regression tasks, where we predict a continuous number, and for classification tasks, where the target is a class. Here is one example for each.

Formula for the classification impurity measure Entropy, and for the regression impurity measure Variance. Image by author.

If you are interested in the intuition behind these formulas and fancy some example calculations, the section below is for you, otherwise feel free to skip it.

Impurity Measure β€” A Deeper Look

First, let’s build intuition on the Entropy formula by understanding how different splits yield higher or lower Entropy values. Consider two splits, the first one, Split A, has 3 reds and 2 purple balls. The second split, Split B, has 4 reds and 1 purple. Which one has a lower impurity?

Comparison of how two splits yield different entropies and how they are calculated. Image by author.

The winner is Split B. The 5 balls are more similar to each other in Split B as there is a better division between red and purple balls.

We can also visualise these two splits by graphing the Entropy formula and checking where they lie on the graph.

Graph showing how Entropy (impurity) changes with different proportions. Image by author.

You can check that Split B has a lower impurity than Split A, as the latter has its values higher in the impurity axis. Splits with lower impurity will have a large proportion and another very small one, yielding a low impurity as the additions involve low values. Splits with more equal proportions, around the 0.5 mark, will be closer to the inflection point of the graph (higher up in the impurity scale).

The example below shows how the impurity would be calculated for a candidate split when training a decision tree.

Example of entropy calculation for a hypothetical split. Image by author.

Note there are some steps that are not expected from the formula. Since the decision boundary (split) creates two groups (left and right), each requires an entropy calculation. As all splits are not created equally, we use the weighted sum to give the appropriate importance to each, meaning that those splits with more examples are more representative of the data, so these have a higher weight in the impurity calculation.

Advanced readers might be wondering what is the need of the logarithm if we could get a similar effect by squaring the proportion. This version of an impurity measure exists, it’s called the Gini Impurity. Here is the graph comparing the impurity of both versions at different proportions.

Graph comparing Entropy against Gini Impurity. X-axis is the proportion, Y-axis is the impurity. Image by author.

Gini Impurity uses squared probabilities, which gives a smoother curve and often prefers larger, more dominant classes.

Entropy has a logarithmic curve, so it reacts more strongly to changes in smaller class probabilities (notice the earlier bump) potentially leading to different, sometimes more balanced, splits.

Now, let’s look at the Variance formula for decision trees in Regression tasks. This time since there are no classes, we won’t be basing our impurity on proportions but rather on how much the mean of the group differs from the target value.

The following example shows the calculations when looking at a potential split for regression, this time on house prices. Similar to the classification task, the split will generate two groups and a calculation on each group is performed to get the total impurity.

But how do you prevent overfitting?

You might have noticed that if you keep making splits looking to minimise impurity you will end up splitting the data so much, that it will isolate every single data point. This will create a massive tree that branches into leaf nodes that each represent just one sample from the training data. This defeats the whole purpose of the decision tree as you wouldn’t be able to generalise on unseen data. To prevent this sort of overfitting we can introduce a stopping criteria that stops the decision tree from growing given certain conditions.

Stopping Criteria

The stopping criteria is a set of rules that prevent the tree from growing too large and overfit the data. There are many criteria that can be used in decision trees and most can be used together during training. The following is a non-exhaustive list.

Maximum Depth Reached
The tree stops growing when it reaches a set maximum depth (number of splits from root to leaf), it prevents overly complex trees.

Minimum Samples to Split
A node must have at least a certain number of samples to be split further. This prevents splitting small, unreliable groups.

Minimum Gain in Information
A split must reduce impurity by at least a certain amount to be accepted, otherwise, that branch finalises its splitting by becoming a leaf node.

Maximum Number of Nodes/Leaves
Limits how many total nodes or leaves the tree can have. It prevents excessive growth and memory usage.

The values that we set these rules to are hyper-parameters, meaning they are values we declare before training, and dictate the way in which the decision tree learns. Modifying these values will have an impact on the performance of the decision tree, therefore they must be tuned to achieve the desired performance.

Hyper-parameter tuning is outside the scope of this article, I will issue an article on this soon, but if you’re desperate to apply it to learning trees you can read this article by GeeksForGeeks on how to perform hyper-parameter tuning on decision trees using Python.

Having covered how decision trees work and learn, we may now look into some worked examples with code.

Worked example for Classification

Let’s start with the life-long question: When should I bring an umbrella? Consider the following data of days where an umbrella was successfully brought or not.

Table of data for the task of deciding when to take an umbrella. Image by author.

A quick glance at this data reveals that it's always good to bring an umbrella when it's raining except when there are extreme wind speeds, that will make you take off into infinity.

You can choose any classification impurity measure, but for now let’s use Gini Impurity.

Formula of Gini Impurity. Image by author

The first thing we require in our code is the ability to read the data. We are using words to describe the weather conditions, but to operate on these we need to change them into numbers. This is where we use the function load_data_from_csv.

import numpy as np
import csv

def load_data_from_csv(path):
"""
Required to turn our worded data into usable numbers for the decision tree.
"""

# X is the feature matrix (upper case letter suggests its a matrix and not a vector)
# y is the target variable vector (what we want to measure)
X, y = [], []
condition_mapping = {"sunny": 0, "cloudy": 1, "rainy": 2}
with open(path, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
X.append([condition_mapping[row['conditions']], float(row['wind'])])
y.append(1 if row['umbrella'] == "yes" else 0)

return np.array(X), np.array(y)

Now we need a representation of our tree, as well as our measures of impurity. The TreeNode class represents the decision tree. When the tree expands into two different branches (child nodes), these will also be subtrees and therefore also part of the class TreeNode.

The weighted_impurity function does the same as we explained earlier on the impurity measure deep dive, which is getting the weighted impurity so that if we have more samples on one side, this one gets more importance than the other, less populated side.

class TreeNode:
def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
self.feature = feature # Feature to split on
self.threshold = threshold # Threshold for the split
self.left = left # Left child node
self.right = right # Right child node
self.value = value # Value if this is a leaf node

def gini_impurity(y):
classes, counts = np.unique(y, return_counts=True)
probabilities = counts / len(y)
return 1 - np.sum(probabilities ** 2)

def weighted_impurity(left_y, right_y, impurity_function):
n = len(left_y) + len(right_y)
left_weight = len(left_y) / n
right_weight = len(right_y) / n
return (
left_weight * impurity_function(left_y) + right_weight * impurity_function(right_y)
)

We can represent a tree and find the impurity of a split, but we must find what is the best split. For this we must iterate over all features and values to find which threshold yields the lowest impurity and therefore shows the best split.

The first for-loop iterates over the features, and the second for-loop iterates over the values of that feature. The middle number between two values is chosen as a threshold, and the impurity is calculated. This is repeated for all values and all features. When the best threshold for the best feature is found, this will become the first split.

FEATURES = {0: "conditions", 1: "wind_speed"}

def find_best_split(X, y, impurity_function):
best_feature = None
best_threshold = None
best_impurity = float('inf')

# iterate over features
for feature_idx in range(X.shape[1]):
sorted_indices = np.argsort(X[:, feature_idx])
X_sorted = X[sorted_indices, feature_idx]
y_sorted = y[sorted_indices]

# iterate over values
for i in range(1, len(X_sorted)):
if X_sorted[i] == X_sorted[i - 1]:
continue
threshold = (X_sorted[i] + X_sorted[i - 1]) / 2
left_y = y_sorted[:i]
right_y = y_sorted[i:]
split_impurity = weighted_impurity(left_y, right_y, impurity_function)

if split_impurity < best_impurity:
best_feature = feature_idx
best_threshold = threshold
best_impurity = split_impurity

best_feature_word = FEATURES[best_feature]
print(f"Best Feature: {best_feature_word}")
print(f"Best Threshold: {best_threshold}")
print(f"Best Impurity: {best_impurity}\n")
return best_feature, best_threshold, best_impurity

To build the tree, the process of finding the best split is repeated until any of the stopping criteria is met. For each split, a node is added to the tree. Here we can also choose the stopping criteria, in this case:

  • Maximum depth = 5
  • Minimum samples split = 2
  • Minimum impurity decrease = 1e-7
def build_tree(X, y, impurity_function, depth=0, max_depth=5, min_samples_split=2, min_impurity_decrease=1e-7):
if len(y) < min_samples_split or depth >= max_depth or impurity_function(y) < min_impurity_decrease:
leaf_value = np.bincount(y).argmax()
return TreeNode(value=leaf_value)

best_feature, best_threshold, best_impurity = find_best_split(X, y, impurity_function)
if best_feature is None:
leaf_value = np.bincount(y).argmax()
return TreeNode(value=leaf_value)

left_indices = X[:, best_feature] <= best_threshold
right_indices = X[:, best_feature] > best_threshold

left_subtree = build_tree(X[left_indices], y[left_indices], impurity_function, depth + 1, max_depth, min_samples_split, min_impurity_decrease)
right_subtree = build_tree(X[right_indices], y[right_indices], impurity_function, depth + 1, max_depth, min_samples_split, min_impurity_decrease)

return TreeNode(feature=best_feature, threshold=best_threshold, left=left_subtree, right=right_subtree)

With the chosen stopping criteria the decision tree ends after two iterations. These are the best splits it finds.

# 1st Iteration
Best Feature: conditions
Best Threshold: 1.5
Best Impurity: 0.1875

# 2nd Iteration
Best Feature: wind_speed
Best Threshold: 40.0
Best Impurity: 0.0

It immediately found that the best first split would be to only consider rainy conditions to get an umbrella. This is why the best feature is the conditions, and the best threshold is the conditions above 1.5, so only rainy (rainy = 2) as sunny = 0 and cloudy = 1. This yielded the lowest impurity at 0.1875.

The next best decision is to stop taking an umbrella at high wind speeds, in this case at 40 km/h. This finished the learning process as it achieved an impurity of 0.

Decision tree on when to take an umbrella. Image by author

Worked example for Regression

Following from the example on house price predictions, let’s code the regression decision tree using an extended version of that data.

Table of data for house prices. Image by author.

We will slightly modify the data loading function to ingest the housing data.

def load_data_from_csv(filename):
X, y = [], []
with open(filename, newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
X.append([float(row['size']), float(row['num_rooms'])])
y.append(float(row['price']))
return np.array(X), np.array(y)

The TreeNode class remains exactly the same, but since we are now looking at a regression task, instead of classification, the impurity measure will be different β€” in this occasion: Variance.

def variance(y):
if len(y) == 0:
return 0
return np.var(y)

def weighted_variance(left_y, right_y):
n = len(left_y) + len(right_y)
return (len(left_y) / n) * variance(left_y) + (len(right_y) / n) * variance(right_y)

We’ll still use the same algorithm to find the best split for regression tasks.

The difference between building a regression tree and a classification tree lies in the leaf nodes and it's really subtle. When you reach the leaf node at a classification tree your output should be one of the possible classes, this is why we use np.bincount(y).argmax(), as this returns the class that appears the most in that final group. This way when we have reached the end of the tree, where we must make a prediction, and we are left with an impure group of several classes, we choose the most frequent/popular one.

This is different from regression trees because the output is a continuous number. So, instead of taking the most frequent class, we take the mean of all the numbers we have in the remaining group. Hence the use of np.mean(y).

def build_regression_tree(X, y, impurity_function, depth=0, max_depth=5, min_samples_split=2, min_variance_decrease=1e-7):
if len(y) < min_samples_split or depth >= max_depth or impurity_function(y) < min_variance_decrease:
return TreeNode(value=np.mean(y))

best_feature, best_threshold, best_variance = find_best_split(X, y)
if best_feature is None:
return TreeNode(value=np.mean(y))

left_indices = X[:, best_feature] <= best_threshold
right_indices = X[:, best_feature] > best_threshold

left_subtree = build_regression_tree(X[left_indices], y[left_indices], depth + 1, max_depth, min_samples_split, min_variance_decrease)
right_subtree = build_regression_tree(X[right_indices], y[right_indices], depth + 1, max_depth, min_samples_split, min_variance_decrease)

return TreeNode(feature=best_feature, threshold=best_threshold, left=left_subtree, right=right_subtree)

This time we have three iterations before we meet the stopping criteria.

Note that in regression tasks, the impurity measure is heavily influenced by the range of the target variable. In this example, house prices range from 150 (lowest) to 330 (highest). A dataset with a larger range will naturally have a higher impurity value compared to one with a smaller range, simply because variance scales with the spread of the data. However, this does not mean that a dataset with a higher impurity produces better splits than one with a lower impurity. Since they represent different distributions, each dataset should be evaluated independently based on how well the feature splits reduce impurity relative to its own scale.

Best Feature: size
Best Threshold: 95.0
Best Impurity: 768.0

Best Feature: size
Best Threshold: 75.0
Best Impurity: 213.33333333333334

Best Feature: size
Best Threshold: 115.0
Best Impurity: 200.0

An interesting finding is that the number of rooms is found to not be a good feature to base the splitting on. If you look at the data again, you will notice a high correlation between the price and the size of the house, they almost seem to increase by a steady amount. This is why the size yielded the lowest impurity and why the size is chosen to be the best feature in every iteration.

Decision tree on house prices. Image by author.

Conclusions

Decision trees are a fundamental component of powerful machine learning models like XGBoost, as they offer high predictive performance and excellent interpretability.

As seen in the examples, the ability to recursively split data based on the most informative features makes them highly effective at capturing complex patterns, while their structured decision-making process provides clear insights into model behaviour.

Unlike black-box models, decision trees allow us to understand why a prediction was made, which is crucial in domains like finance and healthcare. This balance of power, efficiency, and explainability makes decision trees and their boosted ensembles essential tools in modern machine learning.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓