2025, Sep 30 03:00

Compare Feature and Target Transformations in Scikit-learn Regression Using GridSearchCV

Learn how to evaluate feature and target transformations in scikit-learn regression with GridSearchCV, pipelines, and cross-validation to pick the best model.

Evaluating multiple transformations of features and the target in the same regression setup is a common need. The straightforward way is to handcraft a few pipelines, fit them one by one, and compare metrics. It works, but as soon as the space of options grows, you want something closer to GridSearchCV: a single, consistent way to enumerate valid combinations and pick the best model.

Baseline: manual loop over predefined pipelines

The following pattern builds several pipelines, fits each on the same training data, and records MSE and R2 to decide which variant performs best.

import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer, PowerTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.compose import TransformedTargetRegressor
model_space = {
    'linear': LinearRegression(),
    'power': make_pipeline(PowerTransformer(), LinearRegression()),
    'log': make_pipeline(FunctionTransformer(np.log, np.exp), LinearRegression()),
    'log_sqrt': TransformedTargetRegressor(
        regressor=make_pipeline(FunctionTransformer(np.log, np.exp), LinearRegression()),
        func=np.sqrt,
        inverse_func=np.square
    )
}
scores_df = pd.DataFrame()
for label, algo in model_space.items():
    algo.fit(X_train, y_train)
    y_pred_eval = algo.predict(X_eval)
    y_pred_train = algo.predict(X_train)
    r2_val = algo.score(X_train, y_train)
    scores_df.at[label, 'MSE'] = mean_squared_error(y_train, y_pred_train)
    scores_df.at[label, 'R2'] = r2_val
best_key = scores_df['R2'].idxmax()

This approach is simple and transparent. You know exactly which pipelines you tried, you can access each fitted object by its key, and you can store whatever diagnostics you need.

What’s the limitation

Once you want a more systematic sweep of transformations, the manual loop becomes clunky. You may want to exclude specific combinations, include a target transformation only with certain feature transforms, and keep evaluation settings consistent. That’s precisely the use case for scikit-learn’s GridSearchCV, which lets you enumerate valid options and select the best estimator from that space.

Moving to GridSearchCV with multiple parameter grids

A flexible way to structure the search is to define a generic Pipeline with two steps and then pass a list of parameter grids. Using more than one grid makes it easy to forbid “unnecessary” or invalid mixes by simply not listing them. It also accommodates a TransformedTargetRegressor when you need to transform the target.

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, PowerTransformer
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
import numpy as np
base_flow = Pipeline([
    ('transformer', None),
    ('estimator', LinearRegression())
])
param_maps = [
    {'transformer': [FunctionTransformer(np.log, np.exp), PowerTransformer()]},
    {
        'transformer': [FunctionTransformer(np.log, np.exp)],
        'estimator': [
            TransformedTargetRegressor(
                regressor=LinearRegression(),
                func=np.sqrt,
                inverse_func=np.square
            )
        ]
    }
]
searcher = GridSearchCV(base_flow, param_maps, cv=5)
searcher.fit(X_mat, y_vec)
chosen = searcher.best_estimator_

This setup evaluates only the combinations you explicitly allow. The first grid explores feature transformations feeding directly into LinearRegression. The second grid adds a path where features are log-transformed and the target is transformed via TransformedTargetRegressor. This structure avoids unwanted pairings by design and keeps the search space readable.

One trade-off compared to the handcrafted dictionary is discoverability of individual fitted variants. In the manual loop, each model is at your fingertips. With GridSearchCV you get the best_estimator_ and can still inspect cv_results_, but individual fitted models are not exposed in the same direct way.

Why this matters

When you compare transformations consistently, you reduce ad-hoc decisions and keep configuration under control. A single search object consolidates evaluation logic, enables uniform cross-validation settings, and integrates target transformation alongside feature pipelines.

Conclusion

If you are testing several regression transformations, start from a generic Pipeline with named steps and enumerate only the combinations you truly want in a list of parameter grids. Fit with GridSearchCV and retrieve the winning estimator via best_estimator_. If you need to keep hold of every fitted variant for custom inspection, the simple dictionary-and-loop pattern remains a practical alternative. Choose the path that matches how much control and automation you need at evaluation time.

The article is based on a question from StackOverflow by p1xel and an answer by p1xel.