Exploring the Popularity of Random Forest Among GIS Data Experts
Last Updated on June 3, 2024 by Editorial Team
Author(s): Stephen Chege-Tierra Insights
Originally published on Towards AI.
I was thinking of a unique way to introduce my article, since it is sort of a healthy obsession between two, how should I refer to them? Features, elements, or things? I decided to pen a poem.
Now, let me make it clear, I can never compose a poem even if my life depended on it. The only time I ever wrote a poem was in the 8th grade or something, that was to impress a girl, which ended in disaster. Anyway, I prompted CHAT GPT to pen a poem describing the fascination between Data scientists and Random Forest, and here is what it came up with. Please do not zone out, I am going somewhere with this.
This was written by AI.
In forests deep, where data gleams, Lies a tale of scientistβs dreams. Random Forest, its name divine, Where dataβs secrets intertwine.
With branches strong and leaves arrayed, A modelβs prowess is displayed. In unity, predictions soar, A symphony of dataβs lore.
Obsession stirs, in minds, and it grows, As Randomβs magic brightly glows. In its embrace, the quest unfolds, Why Data Scientists, its allure beholds.
So let us journey, and delve deep within, Where data reigns, and minds begin. To understand the mystique untold Of why Random Forest, we so behold
Not bad right? I want to delve into why Random Forest has suddenly become so popular among data scientists, what is the sweet source that makes it so adored and how long will it stay relevant, especially during the AI boom?
What is Random Forest?
According to IBM, Random Forest as a popular machine learning algorithm that can be used for a variety of tasks, including regression and classification. It is a collective method, meaning that a random forest model is made up of a large number of small decision trees, which each produce their predictions. The random forest model pools the predictions of the trees to produce a more accurate estimate.
As we all know, a forest contains several trees, and the more trees it has, the more robust it is. A Random Forest Algorithmβs accuracy and ability to address problems rise with the number of trees in the algorithm.
To put it in non-data scientist terms, using Random Forest is similar to asking your friends for advice. Each friend (tree) makes a decision based on several factors. The final decision is then made based on which of them is the most popular choice. It combines many of these trees to produce accurate classifications or predictions.
Leo Breiman and Adele Cutler developed Random Forest as an ensemble learning technique for regression and classification applications in 2001. It expands on the idea of decision trees, which are basic models that forecast data features by asking a sequence of if-then questions.
So, why the sudden rise in popularity?
Researchers and developers could quickly obtain Random Forest implementations because they were readily available in major machine learning tools like R and sci-kit-learn (Python), the latter has become a popular software interface for data scientists. These libraries reduced the adoption hurdle by offering effective and thoroughly documented Random Forest implementations.
It has proven to be very effective, especially when it comes to regression and classification, also its ability to handle large data sets in record processing times and with high accuracy has made the life of a data scientist easier compared to other machine learning techniques.
The machine learning core communityβs ongoing research and development relentless effort has resulted in improvements and expansions to the Random Forest method. Random Forest has remained applicable and flexible in the face of changing difficulties and problems thanks to its constant innovation.
As I mentioned earlier, Random Forest is available in R studios and Python, which have become very popular among GIS data scientists, statisticians and software developers. Random Forest algorithms are well documented, making it easy to find instructions on how to implement, deploy, and, most importantly, debug in case you encounter an error with the code.
How to load Random Forest with Python.
from sklearn.ensemble import RandomForestClassifier
Loading Random Forest with R.
install.packages("randomForest")
library(randomForest)
# Load the pre-trained Random Forest model from the file
loaded_rf_model <- readRDS("random_forest_model.rds")# Now, you can use the loaded model for predictions or other tasks
Random Forest and GIS
Random forest is being utilized in several sectors of society such as environmental science, earth observation, hospitality, urban planning, video game design and robotics. In the geospatial space, random forest algorithms are essential in the geospatial area for tasks including modelling species distribution, land cover categorization, vegetation mapping, and predicting urban growth. They efficiently extract information on environmental patterns, land use dynamics, and habitat suitability from remote sensing, satellite imagery, and GIS data.
The applications of Random Forest models in environmental management, urban planning, natural resource conservation, and disaster response are critical due to their resilience to noise and capacity to handle high-dimensional data.
A simple Random forest code snippet for Python.
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)# Make predictions on the test data
predictions = rf_classifier.predict(X_test)# Evaluate the accuracy of the classifier
accuracy = (predictions == y_test).mean()
print("Accuracy:", accuracy)
Random Forest code snippet for R studios.
# Install and load the randomForest package
install.packages("randomForest")
library(randomForest)
# Load your dataset (replace 'your_dataset.csv' with the actual file path)
data <- read.csv("your_dataset.csv")# Split the data into predictors (X) and target variable (y)
X <- data[, -target_column_index] # Exclude the target variable column
y <- data$target_column_index # Specify the target variable column index# Split the data into training and testing sets
set.seed(42) # Set seed for reproducibility
train_indices <- sample(1:nrow(data), 0.8 * nrow(data)) # 80% training data
X_train <- X[train_indices, ]
y_train <- y[train_indices]
X_test <- X[-train_indices, ]
y_test <- y[-train_indices]# Train the Random Forest model
rf_model <- randomForest(X_train, y_train, ntree = 100)# Make predictions on the test data
predictions <- predict(rf_model, X_test)# Evaluate the accuracy of the model
accuracy <- mean(predictions == y_test)
print(paste("Accuracy:", accuracy))
Python Random Forest for GIS script
import numpy as np
import pandas as pd
import rasterio
from rasterio.plot import show
import geopandas as gpd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load your raster data
raster_path = 'path/to/your/raster.tif'
raster = rasterio.open(raster_path)
image = raster.read()
show(raster)# Assume the raster has multiple bands
bands, height, width = image.shape
image = image.reshape((bands, height * width)).transpose()# Load your training data
# Training data should be a shapefile with labeled points or polygons
training_data_path = 'path/to/your/training_data.shp'
training_data = gpd.read_file(training_data_path)# Sample features and labels from the training data
# Make sure the training data has columns for the labels and coordinates
features = []
labels = []for index, row in training_data.iterrows():
x, y = row['geometry'].x, row['geometry'].y
row, col = raster.index(x, y)
features.append(image[:, row * width + col])
labels.append(row['label']) # Replace 'label' with your actual label column namefeatures = np.array(features)
labels = np.array(labels)# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=42)# Initialize and train the Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)# Predict on the test set
y_pred = clf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred)}')# Predict on the entire raster
predictions = clf.predict(image)
predictions = predictions.reshape((height, width))# Save the predictions as a new raster
output_raster_path = 'path/to/output_raster.tif'
with rasterio.open(
output_raster_path,
'w',
driver='GTiff',
height=height,
width=width,
count=1,
dtype=predictions.dtype,
crs=raster.crs,
transform=raster.transform,
) as dst:
dst.write(predictions,
For Google Earth Engine
// Initialize the Earth Engine API.
var ee = require('users/google/earthengine:legacy');
var geometry = /* color: #d63000 */ee.Geometry.Polygon(
[[[-122.092, 37.42],
[-122.086, 37.42],
[-122.086, 37.426],
[-122.092, 37.426]]]);// Load a Landsat 8 image collection.
var landsat = ee.ImageCollection('LANDSAT/LC08/C01/T1_SR')
.filterDate('2020-01-01', '2020-12-31')
.filterBounds(geometry)
.median()
.clip(geometry);// Load training data (shapefile containing labeled points or polygons).
var trainingData = ee.FeatureCollection('path/to/your/training_data_shapefile');// Select bands to use for classification.
var bands = ['B2', 'B3', 'B4', 'B5', 'B6', 'B7'];// Sample the input imagery to get a FeatureCollection of training data.
var training = landsat.select(bands).sampleRegions({
collection: trainingData,
properties: ['class'], // Replace 'class' with the actual label column name in your shapefile
scale: 30
});// Train a Random Forest classifier with default parameters.
var classifier = ee.Classifier.smileRandomForest(100).train({
features: training,
classProperty: 'class', // Replace 'class' with the actual label column name in your shapefile
inputProperties: bands
});// Classify the image.
var classified = landsat.select(bands).classify(classifier);// Display the results.
Map.centerObject(geometry, 10);
Map.addLayer(landsat, {bands: ['B4', 'B3', 'B2'], max: 0.3}, 'Landsat 8');
Map.addLayer(classified, {min: 0, max: 3, palette: ['red', 'green', 'blue', 'yellow']}, 'Classification');// Print accuracy assessment.
var validation = classified.sampleRegions({
collection: trainingData,
properties: ['class'],
scale: 30
});var testAccuracy = validation.errorMatrix('class', 'classification');
print('Validation error matrix: ', testAccuracy);
print('Validation overall accuracy: ', testAccuracy.accuracy());
What does the future look like?
Well, it is difficult to tell the future, especially with the rapid change in the tech world, but with the AI spring currently in full force. The need for more effective random forest algorithms will grow to keep up with the desire for AI and deep learning.
Random Forest algorithms will remain crucial in my field of expertise, the geospatial and environmental sciences, for evaluating remote sensing data, forecasting changes in land cover, determining the suitability of habitats, and tackling climate change-related issues as data scientists keep innovating it around through R &D.
Itβs possible that Random Forest could fade into obscurity as more sophisticated algorithms, like a well-developed K-nearest Neighbor tailored for deep learning and artificial intelligence requirements, emerge as dominant. However, only time will reveal the trajectory of such developments.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI