2025, Sep 29 17:00

Brier Skill Score returning NaN in scikit-learn CV? Here’s the fix: set response_method='predict_proba'

Get NaN from Brier Skill Score in scikit-learn CV? See why needs_proba breaks and fix it with response_method='predict_proba' for reliable probability metrics.

Getting NaN when scoring models is frustrating, especially when you’re trying to compare probabilistic classifiers on an imbalanced dataset. Here’s a concise walkthrough of why this happens with Brier Skill Score (BSS) in cross-validation and how to fix it without changing your modeling setup.

Context: evaluating BSS on an imbalanced fraud dataset

The target is highly imbalanced: Counter({0: 2067, 1: 66}) out of roughly 2133 rows. With 10-fold cross-validation that’s only about 6–7 positives per fold, which often prompts suspicion about instability. Yet the NaN scores here had a different cause: scorer configuration.

Repro: how NaN BSS appears

The Brier Skill Score compares your model’s Brier score to a baseline that always predicts the positive rate. Values greater than zero indicate improvement over the baseline; negative values indicate worse-than-baseline performance.

import pandas as pd
import numpy as np
from numpy import mean, std
from collections import Counter

from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline

from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.metrics import brier_score_loss, make_scorer
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# --------------------------------------------------
# 1) Brier Skill Score
# --------------------------------------------------
def skill_brier(y_obs, y_pred_prob):
    pos_share = np.count_nonzero(y_obs) / len(y_obs)
    ref_pred = [pos_share for _ in range(len(y_obs))]

    ref_bs = brier_score_loss(y_obs, ref_pred)
    mdl_bs = brier_score_loss(y_obs, y_pred_prob)

    if ref_bs == 0:
        return 0.0
    return 1.0 - (mdl_bs / ref_bs)

# --------------------------------------------------
# 2) Cross-validated evaluation (problematic setup)
# --------------------------------------------------
def run_eval(X, y, estimator, folds=10, reps=3):
    kfold = RepeatedStratifiedKFold(n_splits=folds, n_repeats=reps, random_state=42)
    # This is where NaN may appear in the scores
    scorer = make_scorer(skill_brier, needs_proba=True)
    out = cross_val_score(estimator, X, y, scoring=scorer, cv=kfold, n_jobs=-1)
    print("Mean BSS: %.3f (%.3f)" % (mean(out), std(out)))
    return out

# --------------------------------------------------
# 3) Preprocessing + model pipeline
# --------------------------------------------------
def make_flow(X, base_estimator=None):
    num_feats = X.select_dtypes(include=["int64", "float64"]).columns
    cat_feats = X.select_dtypes(include=["object", "category"]).columns

    num_block = Pipeline(steps=[
        ("num_impute", SimpleImputer(strategy="mean")),
        ("num_scale", StandardScaler())
    ])

    cat_block = Pipeline(steps=[
        ("cat_impute", SimpleImputer(strategy="most_frequent")),
        ("cat_ohe", OneHotEncoder(handle_unknown="ignore"))
    ])

    features = ColumnTransformer(
        transformers=[
            ("num_blk", num_block, num_feats),
            ("cat_blk", cat_block, cat_feats)
        ]
    )

    final_est = base_estimator if base_estimator is not None else RandomForestClassifier(random_state=42)

    pipe = ImbPipeline(steps=[
        ("prep", features),
        ("clf", final_est)
    ])
    return pipe

# --------------------------------------------------
# 4) Example use
# --------------------------------------------------
# df = pd.read_csv("credit_card.csv")
# X = df.drop("Fraud_Flag", axis=1)
# y = LabelEncoder().fit_transform(df["Fraud_Flag"])

print(X.shape, y.shape, Counter(y))

base = DummyClassifier(strategy="prior")
base_pipe = make_flow(X, base)
print("\nBaseline (DummyClassifier):")
run_eval(X, y, base_pipe)

lr_pipe = make_flow(X, LogisticRegression(max_iter=1000))
print("\nLogistic Regression:")
run_eval(X, y, lr_pipe)

rf_pipe = make_flow(X, RandomForestClassifier(random_state=42))
print("\nRandom Forest:")
run_eval(X, y, rf_pipe)

gb_pipe = make_flow(X, GradientBoostingClassifier(random_state=42))
print("\nGradient Boosting:")
run_eval(X, y, gb_pipe)

What’s actually going wrong

The NaN values come from how the scorer is defined. Using make_scorer with needs_proba=True is deprecated in recent versions of scikit-learn and can lead to unstable behavior. In this setup the metric expects predicted probabilities, so the scoring function must explicitly call predict_proba.

The fix: force predict_proba via response_method

Specify the response method directly when building the scorer. This removes the instability and yields valid BSS values.

def run_eval_fixed(X, y, estimator, folds=10, reps=3):
    kfold = RepeatedStratifiedKFold(n_splits=folds, n_repeats=reps, random_state=42)
    # Explicitly request predict_proba for probabilistic metrics
    scorer = make_scorer(skill_brier, response_method="predict_proba")
    out = cross_val_score(estimator, X, y, scoring=scorer, cv=kfold, n_jobs=-1)
    print("Mean BSS: %.3f (%.3f)" % (mean(out), std(out)))
    return out

# Example usage (same pipelines as above)
print("\nBaseline (DummyClassifier) with fixed scorer:")
run_eval_fixed(X, y, base_pipe)

print("\nLogistic Regression with fixed scorer:")
run_eval_fixed(X, y, lr_pipe)

print("\nRandom Forest with fixed scorer:")
run_eval_fixed(X, y, rf_pipe)

print("\nGradient Boosting with fixed scorer:")
run_eval_fixed(X, y, gb_pipe)

Why this matters

When you evaluate probabilistic models with Brier Skill Score, the scorer must receive class probabilities, not class labels or decision scores. Relying on deprecated flags can silently break that contract and produce NaN, hiding true model performance. The explicit response_method="predict_proba" removes ambiguity and makes the scoring path deterministic.

In highly imbalanced datasets like the one described here—66 positives out of 2133—each fold may contain only about 6–7 positives with 10-fold CV. That context can make scores feel noisy. In this case, though, the NaN issue was resolved by fixing the scorer configuration.

Takeaways

If a probabilistic metric returns NaN under cross-validation, check how the scorer passes predictions to your metric. For Brier Skill Score, set response_method="predict_proba" in make_scorer to ensure it operates on probabilities. With imbalanced targets, keep an eye on fold composition and stick to stratified cross-validation as shown above.

The article is based on a question from StackOverflow by Br0k3nS0u1 and an answer by Br0k3nS0u1.

imbalanced-data pandas python scikit-learn