Random forests are ensemble models that aggregate many decision trees to reduce variance and improve generalization.1 This document walks through training and interpreting a random forest classifier in Python, with a mix of narrative, math, and visuals.
Mathematical Model
Each tree \(T_b\) is trained on a bootstrap sample \(\mathcal{D}_b\) and a random subset of features. The forest prediction for a classification task with \(B\) trees is the majority vote:
The randomization across bootstrapped data and feature subsampling drives decorrelation between trees, delivering lower variance than single-tree models.
Requirement already satisfied: scikit-learn in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (1.7.2)
Requirement already satisfied: numpy>=1.22.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (2.3.5)
Requirement already satisfied: scipy>=1.8.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (1.16.3)
Requirement already satisfied: joblib>=1.2.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (1.5.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (3.6.0)
Code
import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.datasets import load_breast_cancer, make_classificationfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import RocCurveDisplay, ConfusionMatrixDisplay, classification_reportfrom sklearn.decomposition import PCAfrom sklearn.inspection import DecisionBoundaryDisplaysns.set_theme(style="whitegrid")
Data Loading and Preparation
We will use the Breast Cancer Wisconsin dataset bundled with scikit-learn, which contains 30 features computed from digitized fine needle aspirate images.2
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_estimators
400
criterion
'gini'
max_depth
None
min_samples_split
2
min_samples_leaf
2
min_weight_fraction_leaf
0.0
max_features
'sqrt'
max_leaf_nodes
None
min_impurity_decrease
0.0
bootstrap
True
oob_score
False
n_jobs
-1
random_state
42
verbose
0
warm_start
False
class_weight
None
ccp_alpha
0.0
max_samples
None
monotonic_cst
None
Evaluate cross-validated training performance to estimate generalization ability.
To visualize the decision boundaries of the high-dimensional Random Forest model, we project the data onto its first two Principal Components (PCA). This allows us to see how the model separates classes in a 2D latent space.
This graph illustrates how the model’s Out-of-Bag (OOB) error rate stabilizes as the number of trees in the forest increases, demonstrating the ensemble effect.
Code
n_estimators_range =range(15, 300, 10)oob_errors = []for n in n_estimators_range: rf_oob = RandomForestClassifier(n_estimators=n, warm_start=True, oob_score=True, random_state=42, n_jobs=-1) rf_oob.fit(X_train, y_train) oob_errors.append(1- rf_oob.oob_score_)plt.figure(figsize=(10, 6))plt.plot(n_estimators_range, oob_errors, marker='o', linestyle='-', color='purple')plt.title("OOB Error Rate vs. Number of Trees")plt.xlabel("Number of Trees (n_estimators)")plt.ylabel("OOB Error Rate")plt.grid(True)plt.tight_layout()plt.show()
Hyperparameter Considerations
n_estimators: Increasing trees generally improves stability until diminishing returns set in.
max_depth or min_samples_leaf: Control tree complexity, mitigating overfitting.
max_features: Governs the degree of feature randomness; sqrt is typical for classification.
class_weight: Useful for imbalanced datasets to penalize misclassification of minority classes.
Grid search or Bayesian optimization can systematically explore these settings.3
Practical Tips
Feature scaling: Not required because trees are invariant to monotonic transformations.
Missing values: scikit-learn’s implementation does not handle NaNs; impute beforehand.
Interpretability: Use SHAP values or permutation importance for richer explanations.
Out-of-bag (OOB) estimates: Enable oob_score=True to get a built-in validation metric without a separate hold-out set.