2025, Nov 02 11:00

How to Rebuild a RandomForestRegressor Estimator with DecisionTreeRegressor: Split vs Leaf Weighting

Learn why naive cloning fails and how to replicate RandomForestRegressor tree in scikit-learn: unique samples for splits, weighted duplicates for leaf values.

Replicating a single decision tree from a trained RandomForestRegressor with a stand‑alone DecisionTreeRegressor often looks trivial: take the exact bootstrap sample the forest used, pass the same hyperparameters and seed, and fit. In practice you end up with a different tree and different predictions. The key is how the forest handles duplicate rows in the bootstrap and how that affects splits versus value and error calculations.

Problem setup: why a naïve clone diverges

The following snippet trains a forest, pulls the first estimator’s bootstrap indices, and tries to rebuild that tree as a standalone regressor using those exact rows. Despite identical data, hyperparameters, and random_state, the cloned tree typically differs from the forest’s first estimator, both structurally and in predictions.

# Imports
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
# Data dimensions
n_obs = 160
n_dim = 20
# Training set
X_train, y_target = make_regression(
    n_samples=n_obs,
    n_features=n_dim,
    random_state=0,
    shuffle=False
)
# Test set
np.random.seed(0)
X_eval = np.random.standard_normal((40, n_dim))
# Random forest config
n_trees = 10
max_depth_cfg = 4
min_leaf_cfg = 17
min_split_cfg = 3
forest_model = RandomForestRegressor(
    random_state=0,
    oob_score=False,
    max_features=None,
    n_estimators=n_trees,
    max_depth=max_depth_cfg,
    min_samples_leaf=min_leaf_cfg,
    min_samples_split=min_split_cfg,
)
forest_model.fit(X_train, y_target)
rf_pred = forest_model.predict(X_eval)
# Inspect the first tree of the forest
first_est = forest_model.estimators_[0]
plot_tree(first_est, filled=True, rounded=True, node_ids=True, fontsize=16)
plt.figure(figsize=(20, 10))
plt.show()
# Attempt to replicate the first tree using its bootstrap sample
boot_idx_full = forest_model.estimators_samples_[0]
X_boot_full = X_train[boot_idx_full]
y_boot_full = y_target[boot_idx_full]
clone_tree = DecisionTreeRegressor(
    random_state=first_est.random_state,
    max_features=None,
    max_depth=max_depth_cfg,
    min_samples_leaf=min_leaf_cfg,
    min_samples_split=min_split_cfg,
)
clone_tree.fit(X_boot_full, y_boot_full)
clone_pred = clone_tree.predict(X_eval)
# Visualize the clone
plot_tree(clone_tree, filled=True, rounded=True, node_ids=True, fontsize=16)
plt.figure(figsize=(20, 10))
plt.show()

What actually happens inside the forest

The crux is how bootstrap duplicates are treated. In a random forest, samples can appear multiple times in a tree’s bootstrap. For choosing features and thresholds, the training logic uses each unique sample once. In other words, duplicates are not allowed to tilt impurity-based split selection. However, when computing the leaf “value” and related scores, duplicates do count; a sample that appears k times contributes with weight k. That asymmetry is enough to make a DecisionTreeRegressor trained on a duplicated bootstrap produce different split decisions, thresholds, and thus a different structure.

Even with identical data and parameters, a tree is only almost deterministic. Ties in impurity can lead to different yet equivalent thresholds, and several thresholds between two consecutive observed values can yield the same split on the training set. That is another source of small divergences, even once the training set is fixed.

Fix: uniquify for split training, then weight for values and error

To get closer to the forest’s internal tree, train the stand‑alone tree on unique bootstrap rows (so duplicates don’t bias split selection), and only then account for duplicate counts when you want to reproduce leaf values and error metrics.

# Rebuild using unique bootstrap rows for split selection
uniq_idx, dup_counts = np.unique(forest_model.estimators_samples_[0], return_counts=True)
X_boot_uniq = X_train[uniq_idx]
y_boot_uniq = y_target[uniq_idx]
rebuilt_tree = DecisionTreeRegressor(
    random_state=first_est.random_state,
    max_features=None,
    max_depth=max_depth_cfg,
    min_samples_leaf=min_leaf_cfg,
    min_samples_split=min_split_cfg,
)
rebuilt_tree.fit(X_boot_uniq, y_boot_uniq)
rebuilt_pred = rebuilt_tree.predict(X_eval)
plot_tree(rebuilt_tree, filled=True, rounded=True, node_ids=True, fontsize=16)
plt.figure(figsize=(20, 10))
plt.show()
# Recompute a leaf value and error using duplicate weights
# Example: left child of the root
root_feat = rebuilt_tree.tree_.feature[0]
root_thr = rebuilt_tree.tree_.threshold[0]
left_mask = X_boot_uniq[:, root_feat] <= root_thr
# Unweighted mean (what the stand-alone tree uses)
leaf_mean_unweighted = y_boot_uniq[left_mask].mean()
# Weighted mean to match the forest's leaf "value"
leaf_weighted_mean = (y_boot_uniq[left_mask] * dup_counts[left_mask]).sum() / dup_counts[left_mask].sum()
# Unweighted MSE used by the stand-alone tree for this leaf
leaf_mse_unweighted = ((y_boot_uniq[left_mask] - leaf_mean_unweighted) ** 2).mean()
# Weighted MSE used by the forest for this leaf
n_eff = dup_counts[left_mask].sum()
leaf_mse_weighted = (((y_boot_uniq[left_mask] - leaf_weighted_mean) ** 2) * dup_counts[left_mask]).sum() / n_eff

This procedure reflects two distinct phases. First, split selection mirrors the forest by using each unique training row once. Second, leaf statistics mirror the forest by weighting each unique row by how many times it appeared in the bootstrap. That explains why a tree trained on duplicated rows diverges: the duplicates change impurity calculations and therefore the chosen splits.

How close is “close enough”

This approach tends to align features and thresholds with the forest’s corresponding estimator. Values and errors then match once you apply counts as weights for the leaf aggregates. It still may not produce a byte‑for‑byte identical tree. Equal‑impurity ties and ranges of equivalent thresholds can yield different but functionally similar splits. In addition, handling duplicates as weights inside the forest interacts with pre‑pruning hyperparameters, which is another reason that a raw duplicate‑row training set doesn’t behave the same as the forest’s internal logic.

Why you want to understand this

Being able to trace what a forest’s estimator is doing helps with debugging, interpretability, and reproducibility. If you expect one tree of a forest to be exactly a DecisionTreeRegressor trained on its bootstrap with duplicates, you’ll be confused by the structural and numerical differences. Knowing that duplicates are collapsed for split selection but restored as weights for values and error resolves that discrepancy and lets you validate what you see inside a forest.

Takeaways

To approximate a RandomForestRegressor’s internal tree, train your clone on unique bootstrap samples to reproduce split decisions, and compute leaf statistics with duplicate counts to reproduce values and errors. Expect near‑equality rather than perfect identity because impurity ties and equivalent thresholds leave some room for arbitrary yet valid choices. When your goal is to inspect how a forest builds its trees, this perspective is sufficient to reconcile the observed differences and to reason clearly about what the model is doing.

The article is based on a question from StackOverflow by NOnaMe and an answer by chrslg.