2025, Oct 07 17:00

Generalized Cross-Validation (GCV) in SciPy make_smoothing_spline: stabilize lambda via unit-spaced x

Fix inconsistent smoothing in SciPy make_smoothing_spline: diagnose GCV scale issues, choose a stable lambda via unit-spaced x, then rescale for reliable fits.

When you rely on make_smoothing_spline to pick its own smoothing strength via GCV, you expect consistent behavior across comparable time series. Yet even after normalizing different series to the same scale, it is possible to get radically different smoothing: one curve looks reasonable, while another collapses to an almost flat line. The example below reproduces that discrepancy and then walks through why it happens and how to make the GCV step more numerically stable without changing model assumptions.

Reproducing the inconsistent smoothing

The following snippet loads two normalized series that are visually very similar. Both are smoothed with make_smoothing_spline using the default GCV-based selection of lam. It also prints the lambda values returned by scipy.interpolate._bsplines._compute_optimal_gcv_parameter for the two precomputed design matrices and penalty terms.

import pickle
import matplotlib.pyplot as plt
from scipy.interpolate import make_smoothing_spline
from scipy.interpolate._bsplines import _compute_optimal_gcv_parameter

# Load prebuilt arrays
with open('good_lam.pkl', 'rb') as fh_a:
    t_ok, matX_ok, wE_ok, series_ok, w_ok = pickle.load(fh_a)
with open('bad_lam.pkl', 'rb') as fh_b:
    t_ko, matX_ko, wE_ko, series_ko, w_ko = pickle.load(fh_b)

# GCV-picked smoothing
spline_ok = make_smoothing_spline(t_ok, series_ok, w=None, lam=None)
yhat_ok = spline_ok(t_ok)
spline_ko = make_smoothing_spline(t_ko, series_ko, w=None, lam=None)
yhat_ko = spline_ko(t_ko)

# Compare lambdas used internally by SciPy's GCV routine
print('lam_ok:', _compute_optimal_gcv_parameter(matX_ok, wE_ok, series_ok, w_ok))
print('lam_ko:', _compute_optimal_gcv_parameter(matX_ko, wE_ko, series_ko, w_ko))

# Visualize
plt.plot(series_ok, label='ok', color='green')
plt.plot(yhat_ok, label='ok_smooth', color='palegreen')
plt.plot(series_ko, label='ko', color='red')
plt.plot(yhat_ko, label='ko_smooth', color='lightpink')
plt.legend()
plt.show()

What’s really going on

The root of the inconsistency is scale sensitivity inside the GCV objective as implemented by _compute_optimal_gcv_parameter. For the problematic dataset, the Frobenius norms of the matrices contributing to the system are very different in magnitude: the norm of XtWX is on the order of 7.5, while XtE is about 1.9e+08. The algorithm forms the left-hand side

lhs = XtWX + lam * XtE

When XtE dwarfs XtWX by many orders of magnitude, XtWX is effectively ignored for a wide range of lam, so the minimizer is driven by a numerically ill-conditioned trade-off. This is highly sensitive to how x is spaced. In fact, rescaling x can change the shape of the GCV curve: one dataset exhibited multiple local minima, which disappeared after multiplying x by 100 and recomputing the intermediate arrays. That observation points at x spacing as a significant factor. There is also evidence that for some series the noise isn’t random, and GCV can then justify an extremely small lam, effectively choosing not to smooth at all.

A practical fix: fit GCV on unit-spaced x and rescale lambda back

A straightforward way to make the computation more numerically stable is to reparameterize the problem to unit spacing in x before calling the GCV routine, then convert the resulting lam back to the original scale. Using evenly spaced x reduces the gap between the norms of XtWX and XtE to a much saner range (e.g., 16 and 40 in one tested case), which makes the optimization behave more predictably.

The conversion is simple: divide x by its average spacing, run the GCV computation on that rescaled grid, and multiply the resulting lambda by x_spacing_avg ** 3 to map it back to the original x. With this approach, for the challenging dataset both a refined GCV search and the rescaled method settle on approximately 2.63e-08.

import numpy as np
from scipy.interpolate._bsplines import _compute_optimal_gcv_parameter
from scipy._lib._array_api import array_namespace
from scipy.interpolate._bsplines import _coeff_of_divided_diff
from scipy.interpolate import BSpline


def pick_gcv_lambda_stable(t_vals, y_vals):
    d = np.diff(t_vals)
    assert (d >= 0).all(), 'x must be sorted'
    mean_step = d.mean()
    assert mean_step != 0, 'div by zero'

    t_unit = t_vals / mean_step
    Phi, WEmat, y_arr, w_vec = assemble_spline_arrays(t_unit, y_vals)
    lam_unit = _compute_optimal_gcv_parameter(Phi, WEmat, y_arr, w_vec)
    return lam_unit * mean_step ** 3


def assemble_spline_arrays(x_in, y_in, w_in=None):
    ax = 0
    _ = array_namespace(x_in, y_in)

    x = np.ascontiguousarray(x_in, dtype=float)
    y = np.ascontiguousarray(y_in, dtype=float)

    if any(x[1:] - x[:-1] <= 0):
        raise ValueError('``x`` should be an ascending array')

    if x.ndim != 1 or x.shape[0] != y.shape[ax]:
        raise ValueError(f'``x`` should be 1D and {x.shape = } == {y.shape = }')

    if w_in is None:
        w = np.ones(len(x))
    else:
        w = np.ascontiguousarray(w_in)
        if any(w <= 0):
            raise ValueError('Invalid vector of weights')

    t_knots = np.r_[[x[0]] * 3, x, [x[-1]] * 3]
    n = x.shape[0]

    if n <= 4:
        raise ValueError('``x`` and ``y`` length must be at least 5')

    y = np.moveaxis(y, ax, 0)
    y_tail = y.shape[1:]
    if y_tail != ():
        y = y.reshape((n, -1))

    # Design matrix in B-spline basis
    bx = BSpline.design_matrix(x, t_knots, 3)
    Phi = np.zeros((5, n))
    for k in range(1, 4):
        Phi[k, 2:-2] = bx[k: k - 4, 3:-3][np.diag_indices(n - 4)]

    Phi[1, 1] = bx[0, 0]
    Phi[2, :2] = ((x[2] + x[1] - 2 * x[0]) * bx[0, 0], bx[1, 1] + bx[1, 2])
    Phi[3, :2] = ((x[2] - x[0]) * bx[1, 1], bx[2, 2])
    Phi[1, -2:] = (bx[-3, -3], (x[-1] - x[-3]) * bx[-2, -2])
    Phi[2, -2:] = (bx[-2, -3] + bx[-2, -2], (2 * x[-1] - x[-2] - x[-3]) * bx[-1, -1])
    Phi[3, -2] = bx[-1, -1]

    WEmat = np.zeros((5, n))
    WEmat[2:, 0] = _coeff_of_divided_diff(x[:3]) / w[:3]
    WEmat[1:, 1] = _coeff_of_divided_diff(x[:4]) / w[:4]
    for j in range(2, n - 2):
        WEmat[:, j] = (x[j+2] - x[j-2]) * _coeff_of_divided_diff(x[j-2:j+3]) / w[j-2:j+3]
    WEmat[:-1, -2] = -_coeff_of_divided_diff(x[-4:]) / w[-4:]
    WEmat[:-2, -1] = _coeff_of_divided_diff(x[-3:]) / w[-3:]
    WEmat *= 6
    return Phi, WEmat, y, w

Once you have the rescaled lam, use it directly when building the smoothing spline on the original x. Nothing else in your pipeline needs to change.

from scipy.interpolate import make_smoothing_spline

lam_fixed = pick_gcv_lambda_stable(t_ko, series_ko)
spline_fixed = make_smoothing_spline(t_ko, series_ko, w=None, lam=lam_fixed)
yhat_fixed = spline_fixed(t_ko)

Why this matters

GCV aims to balance fidelity and smoothness, but the numerical profile of the matrices involved can overwhelm that balance. When x spacing pushes XtE to dominate XtWX by many orders of magnitude, the optimizer can fall into poor local minima or implicitly ignore part of the objective. That’s why seemingly similar, normalized series can produce wildly different lambdas and downstream fits. Rescaling x to unit spacing regularizes the computation without changing the underlying model or assumptions, and then scaling lam back by the cube of the average step keeps the solution consistent with your original data.

There are cases where GCV legitimately returns a very small lambda and does little to no smoothing. That can happen when the variability in the series is not random noise; in such settings, GCV will prefer to retain structure rather than attenuate it. Separately, it has been observed that reshaping x spacing can eliminate extra local minima in the GCV curve. This behavior has been discussed and investigated further; for reference, see the public conversation and issue reports at github.com/scipy/scipy/issues and specifically https://github.com/scipy/scipy/issues/23472.

Takeaways and guidance

If automatic lam selection via GCV gives inconsistent results across comparable time series, look first at x spacing. Compute lam on a unit-spaced reparameterization of x, then multiply by the average spacing cubed to map it back. This simple change can stabilize the norm balance in the system matrix and reduce the chance of pathological minima. If you encounter cases where the smoothed output looks identical to the input, remember that, depending on the structure of your series, an extremely small lam can be the justified outcome of GCV. Finally, if you continue to see sharp inconsistencies or evidence of multiple local minima, it is worth tracking ongoing discussions and fixes in the SciPy issue tracker.

With these adjustments, you can keep using make_smoothing_spline to select lam automatically while improving numerical behavior across datasets with different sampling rates and scales.

The article is based on a question from StackOverflow by David Pagnon and an answer by Nick ODell.

filter python scipy smoothing