Skip to main content

Random Forests in Practice

Author

Data Science Lab

Published

November 21, 2025

Overview

Random forests are ensemble models that aggregate many decision trees to reduce variance and improve generalization.1 This document walks through training and interpreting a random forest classifier in Python, with a mix of narrative, math, and visuals.

Mathematical Model

Each tree \(T_b\) is trained on a bootstrap sample \(\mathcal{D}_b\) and a random subset of features. The forest prediction for a classification task with \(B\) trees is the majority vote:

\[ \hat{y} = \mathrm{mode}\left(\{T_b(\mathbf{x})\}_{b=1}^{B}\right) \]

For regression, the trees are averaged:

\[ \hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(\mathbf{x}) \]

The randomization across bootstrapped data and feature subsampling drives decorrelation between trees, delivering lower variance than single-tree models.

Environment Setup

Code
import importlib
import subprocess
import sys

def ensure(package):
    try:
        importlib.import_module(package)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

for pkg in ("numpy", "pandas", "seaborn", "matplotlib", "scikit-learn"):
    ensure(pkg)
Requirement already satisfied: scikit-learn in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (1.7.2)
Requirement already satisfied: numpy>=1.22.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (2.3.5)
Requirement already satisfied: scipy>=1.8.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (1.16.3)
Requirement already satisfied: joblib>=1.2.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (1.5.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /Volumes/SEALED/DSHB/GALLERY/JJB_Gallery/.venv/lib/python3.13/site-packages (from scikit-learn) (3.6.0)
Code
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import RocCurveDisplay, ConfusionMatrixDisplay, classification_report
from sklearn.decomposition import PCA
from sklearn.inspection import DecisionBoundaryDisplay

sns.set_theme(style="whitegrid")

Data Loading and Preparation

We will use the Breast Cancer Wisconsin dataset bundled with scikit-learn, which contains 30 features computed from digitized fine needle aspirate images.2

Code
dataset = load_breast_cancer(as_frame=True)
df = dataset.frame
df.head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Split the data into training and testing sets (stratified to maintain label balance).

Code
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="target"),
    df["target"],
    test_size=0.25,
    random_state=42,
    stratify=df["target"]
)

X_train.shape, X_test.shape
((426, 30), (143, 30))

Model Training

Code
rf = RandomForestClassifier(
    n_estimators=400,
    max_features="sqrt",
    min_samples_leaf=2,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
RandomForestClassifier(min_samples_leaf=2, n_estimators=400, n_jobs=-1,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluate cross-validated training performance to estimate generalization ability.

Code
cv_scores = cross_val_score(rf, X_train, y_train, cv=5)
cv_scores.mean(), cv_scores.std()
(np.float64(0.9600547195622434), np.float64(0.020556647327639923))

Diagnostics

Code
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=dataset.target_names))
              precision    recall  f1-score   support

   malignant       0.96      0.92      0.94        53
      benign       0.96      0.98      0.97        90

    accuracy                           0.96       143
   macro avg       0.96      0.95      0.95       143
weighted avg       0.96      0.96      0.96       143

Receiver Operating Characteristic

Code
fig, ax = plt.subplots(figsize=(8, 6))
RocCurveDisplay.from_estimator(rf, X_test, y_test, ax=ax)
ax.set_title("Random Forest ROC Curve")
plt.tight_layout()
plt.show()

Receiver Operating Characteristic curve showing the true positive rate vs false positive rate for the Random Forest classifier.

Confusion Matrix

Code
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay.from_estimator(rf, X_test, y_test, display_labels=dataset.target_names, ax=ax, cmap="Blues")
ax.set_title("Confusion Matrix")
plt.tight_layout()
plt.show()

Confusion Matrix heatmap showing true labels vs predicted labels for the Random Forest model.

Feature Importance Visualization

Code
importances = pd.Series(rf.feature_importances_, index=df.columns[:-1]).sort_values(ascending=False)
top_features = importances.head(15)

plt.figure(figsize=(10, 8))
sns.barplot(x=top_features.values, y=top_features.index, palette="viridis")
plt.title("Top 15 Feature Importances (Gini)")
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.tight_layout()
plt.show()

Bar chart displaying the top 15 most important features according to the Random Forest model.

Complex Visualization: Decision Boundaries

To visualize the decision boundaries of the high-dimensional Random Forest model, we project the data onto its first two Principal Components (PCA). This allows us to see how the model separates classes in a 2D latent space.

Code
# PCA Projection
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train)

# Train a new RF on PCA components for visualization
rf_pca = RandomForestClassifier(n_estimators=100, random_state=42)
rf_pca.fit(X_pca, y_train)

# Plot Decision Boundary
fig, ax = plt.subplots(figsize=(10, 8))
DecisionBoundaryDisplay.from_estimator(
    rf_pca,
    X_pca,
    response_method="predict",
    cmap="RdBu",
    alpha=0.8,
    ax=ax,
    xlabel="Principal Component 1",
    ylabel="Principal Component 2",
)
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], c=y_train, cmap="RdBu", edgecolors="k", s=30)
ax.set_title("Random Forest Decision Boundaries (PCA Projected)")
plt.legend(*scatter.legend_elements(), title="Classes")
plt.tight_layout()
plt.show()

Scatter plot with decision boundaries visualized in 2D PCA space, showing how the model separates classes.

Error Rate vs. Number of Trees

This graph illustrates how the model’s Out-of-Bag (OOB) error rate stabilizes as the number of trees in the forest increases, demonstrating the ensemble effect.

Code
n_estimators_range = range(15, 300, 10)
oob_errors = []

for n in n_estimators_range:
    rf_oob = RandomForestClassifier(n_estimators=n, warm_start=True, oob_score=True, random_state=42, n_jobs=-1)
    rf_oob.fit(X_train, y_train)
    oob_errors.append(1 - rf_oob.oob_score_)

plt.figure(figsize=(10, 6))
plt.plot(n_estimators_range, oob_errors, marker='o', linestyle='-', color='purple')
plt.title("OOB Error Rate vs. Number of Trees")
plt.xlabel("Number of Trees (n_estimators)")
plt.ylabel("OOB Error Rate")
plt.grid(True)
plt.tight_layout()
plt.show()

Line plot showing the Out-of-Bag error rate decreasing and stabilizing as the number of trees in the forest increases.

Hyperparameter Considerations

  • n_estimators: Increasing trees generally improves stability until diminishing returns set in.
  • max_depth or min_samples_leaf: Control tree complexity, mitigating overfitting.
  • max_features: Governs the degree of feature randomness; sqrt is typical for classification.
  • class_weight: Useful for imbalanced datasets to penalize misclassification of minority classes.

Grid search or Bayesian optimization can systematically explore these settings.3

Practical Tips

  • Feature scaling: Not required because trees are invariant to monotonic transformations.
  • Missing values: scikit-learn’s implementation does not handle NaNs; impute beforehand.
  • Interpretability: Use SHAP values or permutation importance for richer explanations.
  • Out-of-bag (OOB) estimates: Enable oob_score=True to get a built-in validation metric without a separate hold-out set.

References

Footnotes

  1. Introduced the random forest algorithm with theoretical justification and empirical benchmarks.↩︎

  2. Official description of the dataset, feature definitions, and usage considerations.↩︎

  3. Demonstrated the efficiency gains of random search over grid search for hyperparameter tuning.↩︎

  4. ^breiman2001↩︎

  5. ^sklearn_breast↩︎

  6. ^bergstra2012↩︎