Hands-on Random Forest with Python
Last Updated on July 6, 2022 by Editorial Team
Author(s): Tirendaz Academy
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
A practical guide on how to implement random forests with grid search technique using scikit-learn.
One model may make a wrong prediction. But if you combine the predictions of several models into one, you can make better predictions. This concept is called ensemble learning. Ensembles are methods that combine multiple models to build more powerful models. Ensemble methods have gained huge popularity during the last decade. There are two essential ensemble models based on decision trees: random forest and gradient boosted. In this post, Iβll talk about the following topics,
- What is a randomΒ forest?
- Some advantages and disadvantages of randomΒ forests
- How to implement a random forest with a real-world dataset?
Letβs diveΒ in!
What is RandomΒ Forest?
Random forest is a supervised machine learning algorithm that is used widely in classification and regression problems. You can think of a random forest as an ensemble of decision trees. The decision tree models tend to overfit the training data. You can overcome the overfitting problem using randomΒ forest.
To implement a random forest, you need to build many decision trees. The random forest consists of a collection of decision trees. Each tree in a random forest is slightly different from the others. These trees are selected a different subset of features. Note that these features are randomly selected. When making the final prediction, the predictions of all trees are combined and these predictions are averaged. Since you use many trees, you can reduce the amount of overfitting.
Some Advantages of RandomΒ Forests
Letβs take a look at some advantages of randomΒ forest.
- You can use random forests for both classification and regression tasks.
- Random forests often work well without heavy tuning of the hyperparameters.
- You donβt need to scale theΒ data.
- Random forests may provide better accuracy than decision trees since it overcomes the overfitting problem.
Some Disadvantages of RandomΒ Forests
There are some disadvantages of random forests. Letβs take a look at these disadvantages.
- Random forests cannot be performed well on very high dimensional, and sparse data such as textΒ data.
- Random forests are not simple to interpret since it uses deeper tree than decisionΒ trees.
So you saw some advantages and disadvantages of random forests. Now letβs go ahead and take a look at how to implement random forest with scikitΒ learn.
How to Implement Random Forest with Scikit-Learn?
To show how to implement random forest, Iβm going to use the breast cancer wisconsin datasets. Before loading the dataset, let me importΒ pandas.
import pandas as pd
Letβs load theΒ dataset.
df = pd.read_csv( βbreast_cancer_wisconsin.csvβ)
You can find the notebook and dataset here. Letβs take a look at the first five rows of theΒ dataset.
df.head()
This dataset consists of samples of malignant and benign tumor cells. The first column in the dataset shows the unique ID numbers and the second column shows diagnoses, letβs say M indicates malignant and B indicates benign. The rest of the columns are our features. Letβs take a look at the shape of theΒ dataset.
df.shape
#Output:
(569, 33)
Data Preprocessing
Data preprocessing is one of the most important stages of data analysis. Now, letβs create the input and output variables. To do this, Iβm going to use the loc method. First, let me create our target variable.
y = df.loc[:,"diagnosis"].values
Letβs create our feature variable and remove unnecessary columns. To do this, Iβm going to use the dropΒ method.
X = df.drop(["diagnosis","id","Unnamed: 32"],axis=1).values
Note that our target variable has two categories, M and B. Letβs encode the target variable with a label encoder. First, Iβm going to import thisΒ class.
from sklearn.preprocessing import LabelEncoder
Now, Iβm going to create an object from thisΒ class.
le = LabelEncoder()
Letβs fit and transform our target variable.
y = le.fit_transform(y)
Before building the model, letβs split the dataset into training and test set. To do this, Iβm going to use the train_test_split function. First, let me import this function.
from sklearn.model_selection import train_test_split
Letβs split our dataset using this function.
X_train,X_test,y_train,y_test=train_test_split(X, y,
stratify=y,
random_state=0)
Cool. Our datasets are ready toΒ analyze.
Building A Random ForestΒ Model
To use a random forest in Scikit-Learn, we need to import RandomForestClassifier from the ensemble module. First, letβs import thisΒ class.
from sklearn.ensemble import RandomForestClassifier
Now, Iβm going to create an object from this class. Here, Iβm only going to use defaultΒ values
rf = RandomForestClassifier(random_state = 0)
Next, letβs build our model. To do this, Iβm going to use the fit method with the trainingΒ set.
rf.fit(X_train, y_train)
Awesome. Our model is ready to predict. Letβs evaluate our model using the training and testΒ set.
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
Now, letβs take a look at the performance of our model on datasets. To do this, Iβm going to use the accuracy_score function. First, let me import this function.
from sklearn.metrics import accuracy_score
After that, letβs take a look at accuracy scores for the training and testΒ set.
rf_train = accuracy_score(y_train, y_train_pred)
rf_test = accuracy_score(y_test, y_test_pred)
Now, letβs print theseΒ scores.
print(fβRandom forest train/test accuracies: {rf_train: .3f}/{rf_test:.3f}β)
#Output:
Random forest train/test accuracies:1.000/0.958
Awesome, the scores were printed. As you can see, the score on the training set is 100%, and the score on the test set is 95%. This means that the model has an overfitting problem. Note that this random forest model learned the training set so well. So, it simply memorized the outcome. But, the model cannot generalize. To overcome the overfitting problem, we control the complexity of theΒ model.
Hyperparameter Tuning with GridΒ Search
For model complexity, we need to tune the model using different parameters. To do this, Iβm going to use the grid search technique. Grid search is a model hyperparameter optimization technique. In scikit-learn, this technique is provided in the GridSearchCV class. Letβs import thisΒ class.
from sklearn.model_selection import GridSearchCV
Now, Iβm going to create an object from RandomForestClassifier for use in gridΒ search.
rf = RandomForestClassifier(random_state = 42)
When constructing the GridSearchCV class, you need to provide a dictionary of hyperparameters to evaluate the param_grid argument. This is a map of the model parameter name and an array of values to try. Now, let me create a parameters variable that contains the values of the parameters.
parameters = {'max_depth':[5,10,20], (1)
'n_estimators':[i for i in range(10, 100, 10)], (2)
'min_samples_leaf2:[i for i in range(1, 10)], (3)
'criterion' :['gini', 'entropy'], (4)
'max_features': ['auto', 'sqrt', 'log2']} (5)
(1) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
(2) To build a random forest model, you need to decide on the number of trees. Here, Iβm going to create the values for the n_estimators parameter. n_estimators specify the number of trees in the forest. For this parameter, Iβm used for the loop inΒ list.
(3) The min_leaf_size parameter is used to specify the minimum number of samples in a leaf node. For this parameter, I used the loop-in listΒ again.
(4) I used two parameters for the criterion parameter.
(5) Lastly, I set how to select the features. Note that a critical parameter in the random forest technique is max_features. You use this parameter when looking for the bestΒ split.
Thus, we specified the values of the parameters. To find the best parameters, Iβm going to create an object from GridSearch.
clf = GridSearchCV(rf, parameters, n_jobs= -1)
Next, Iβm going to fit our model with the trainingΒ set.
clf.fit(X_train, y_train)
Finally, to see the best parameters, Iβm going to use the best_params_attribute.
print(clf.best_params_)
#Output:
{'criterion': 'entropy', 'max_depth': 5, 'max_features': 'log2', 'min_samples_leaf': 3, 'n_estimators': 10}
When we execute this cell, you can see the best parameters.
Evaluating The Random ForestΒ Model
Now, Iβm going to predict the values of the training and test set. Note that we donβt need to train our model again. Because after the best parameters are found, the model is trained with these parameters. So you can directly use the clf model for prediction. Letβs predict the values with thisΒ model.
y_train_pred=clf.predict(X_train)
y_test_pred=clf.predict(X_test)
rf_train = accuracy_score(y_train, y_train_pred)
rf_test = accuracy_score(y_test, y_test_pred)
print(fβRandom forest train/test accuracies: {rf_train: .3f}/{rf_test:.3f}β)
#Output:
Random forest train/test accuracies:0.993/0.965
The accuracy scores were printed according to the best parameters. The performance of the model is better on both the training and test set. Notice that the score of our model on the training set is close to the score on the test set. In addition, both accuracy scores are close to 1. So, we have found the best parameters and predicted the values of the training and the testΒ set.
Conclusion
In this post, I talked about random forest and how to implement this technique with scikit learn. A random forest consists of multiple decision trees. This method averages the results of all the trees to output a model. So you can overcome the overfitting problem with this approach. You can perform both classification and regression tasks with this method. Thatβs it. Thanks for reading. I hope you enjoyΒ it.
Please donβt forget to follow us on YouTube | GitHub | Twitter | Kaggle | LinkedIn.
If this post was helpful, please click the clap π button below a few times to show me your supportΒ π
Hands-on Random Forest with Python was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI