2025, Sep 29 23:00

Stop Repeating Undersampling: Precompute Fold Indices for Faster HalvingRandomSearchCV on Imbalanced Data

Speed up hyperparameter tuning on imbalanced data: precompute undersampled CV folds and pass them to HalvingRandomSearchCV to avoid redundant resampling.

Cross-validation with imbalanced data often pushes you toward using a resampling strategy like RandomUnderSampler from imblearn. The catch is that putting a resampler inside a pipeline and then launching a hyperparameter search means the resampling happens over and over again for every split and every candidate. When the undersampling is deterministic for a given fold, that repeated work is just waste.

Problem setup

Consider an approach where an imblearn-compatible Pipeline houses scaling, undersampling and a gradient boosting model. Hyperparameters are tuned via HalvingRandomSearchCV. It looks tidy, but it keeps undersampling on every fit during the search.

def fit_with_resampling(sampler_obj, estimator_obj, do_scale, search_space, X_tr, y_tr):
    # resampling is part of the pipeline to keep the test fold left out
    if do_scale is True:
        pipe = Pipeline([
            ("scale", MinMaxScaler()),
            ("balance", sampler_obj),
            ("algo", estimator_obj),
        ])
    else:
        pipe = Pipeline([
            ("balance", sampler_obj),
            ("algo", estimator_obj),
        ])

    searcher = HalvingRandomSearchCV(
        estimator=pipe,
        param_distributions=search_space,
        n_candidates="exhaust",
        factor=3,
        resource="algo__n_estimators",
        max_resources=500,
        min_resources=10,
        scoring="roc_auc",
        cv=3,
        random_state=10,
        refit=True,
        n_jobs=-1,
    )

    searcher.fit(X_tr, y_tr)
    return searcher

This setup will undersample anew during each CV iteration of each candidate in the search. If the undersampling for a fold does not change between runs, this redundancy is unnecessary.

What actually causes the inefficiency

Resampling inside the pipeline is re-executed at every fit the search performs. HalvingRandomSearchCV explores multiple candidates and performs several fits per split as part of its successive halving schedule. Because the pipeline includes the sampler, every one of those fits recomputes the same undersampled subset for a given fold, even when the sampler and data are unchanged.

Solution: precompute undersampled folds and pass them to the search

The practical way around this is to build folds once, undersample each training fold exactly once, store only the indices, and then hand those precomputed splits to HalvingRandomSearchCV via its cv argument. This keeps the evaluation protocol intact, preserves a clean left-out test fold, and stops repeating the same resampling work.

def build_sampled_folds(X_arr, y_arr, splitter, sampler):
    cached = []
    for tr_idx, te_idx in splitter.split(X_arr, y_arr):
        sampler.fit_resample(X_arr[tr_idx], y_arr[tr_idx])
        tr_idx_sampled = tr_idx[sampler.sample_indices_]
        cached.append((tr_idx_sampled, te_idx))
    return cached

fold_splits = build_sampled_folds(
    X_arr=X_train, y_arr=y_train,
    splitter=KFold(3),
    sampler=RandomUnderSampler()
)

With these folds in hand, wire them into the search and keep the pipeline minimal. The model step name in the pipeline should match the parameter names you pass to the search.

learners = [
    (GradientBoostingClassifier(), {"algo__max_depth": [1, 3]}),
    (RandomForestClassifier(), {"algo__max_depth": [1, 3]})
]

for clf, grid in learners:
    print(clf)
    pipe = Pipeline([
        ("algo", clf)
    ])
    tuner = HalvingRandomSearchCV(
        estimator=pipe,
        param_distributions=grid,
        n_candidates="exhaust",
        factor=3,
        resource="algo__n_estimators",
        max_resources=500,
        min_resources=10,
        scoring="roc_auc",
        cv=fold_splits,  # precomputed undersampled folds
        random_state=10,
        refit=True,
        n_jobs=-1,
        verbose=True
    )
    tuner.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits

Details that matter when refitting

When refit=True, HalvingRandomSearchCV refits the best estimator on the whole dataset. If the goal is to keep training on an undersampled version of the training set, set refit=False and then fit the best configuration separately on the undersampled training data you want to use. This avoids re-training on the full input after the search.

Why this approach is worth it

Precomputing undersampled folds eliminates redundant resampling during hyperparameter optimization without changing the evaluation protocol. It lets you reuse the very same train/test split logic across different models such as XGBoost, CatBoost, GradientBoostingClassifier or RandomForestClassifier, while keeping the cost of undersampling down to a single pass per fold. It also avoids brittle workarounds that concatenate original and undersampled data and then try to coordinate which pieces belong to which fold, a path that is easy to get wrong and prone to leakage.

Takeaways

Keep the resampling step out of the pipeline used by the search when it is deterministic and fold-specific. Build folds once, extract undersampled training indices via the sampler’s sample_indices_, and pass those precomputed splits directly to HalvingRandomSearchCV. If you need the final model to be trained on an undersampled set, turn off automatic refitting and perform the final fit yourself on the data you intend the model to learn from. This way you keep cross-validation clean, efficient, and consistent across all the models you are tuning.

The article is based on a question from StackOverflow by Sole Galli and an answer by SLebedev777.

hyperparameters imblearn python scikit-learn