2025, Nov 26 03:00

How to get true in-sample predictions in skforecast using ForecasterAutoreg and properly aligned exogenous features

Learn why shifting dates fails for in-sample predictions in skforecast and how to get fitted values using ForecasterAutoreg with aligned exogenous features.

Generating in-sample predictions with skforecast can be confusing if you approach it the same way as out-of-sample forecasting. It’s tempting to “replay” the past by shifting dates and asking the forecaster to predict over this artificial horizon. That looks like it should work, but it does not produce true in-sample fitted values. Below is a minimal, reproducible path to the pitfall and the correct way to obtain in-sample predictions that align with the model’s training setup.

Problem setup

The following example shows a common but incorrect strategy: shifting exogenous dates so historical data is treated as future input, then calling predict. The code runs, yet the output is not in-sample fitted values.

import pandas as pd
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from sklearn.linear_model import LinearRegression

# Sample data
target = pd.Series([1, 2, 3, 4, 5], index=pd.date_range("2023-01-01", periods=5, freq="D"))
aux = pd.DataFrame({"var": [10, 20, 30, 40, 50]}, index=target.index)

# Fit forecaster
predictor = ForecasterAutoreg(regressor=LinearRegression(), lags=2)
predictor.fit(y=target, exog=aux)

# Attempt to generate in-sample predictions by shifting exogenous dates
def make_insample_guess(model, exog_frame, freq="D"):
    step_count = exog_frame.shape[0]
    last_seen = model.last_window_.index.max()
    one_step = pd.offsets.Day(1)
    exog_future = exog_frame.copy()
    exog_future.index = pd.date_range(start=last_seen + one_step, periods=step_count, freq=freq)
    return model.predict(steps=step_count, exog=exog_future).set_axis(exog_frame.index)

# Produces values, but they are not true in-sample fitted values
guess = make_insample_guess(predictor, aux)
print(guess)

What’s actually going wrong

This approach is not generating in-sample predictions. It triggers the standard recursive multi-step forecast starting from the end of the training data, while feeding it misaligned exogenous values that used to belong to the past. In-sample predictions, in contrast, must rely on the exact lag structure and exogenous variables as they were during training, producing one-step-ahead fitted values for all indices that are predictable given the maximum lag.

In other words, simply shifting dates and forecasting forward creates a new, synthetic out-of-sample scenario. It does not reconstruct how the model would have predicted each point during training with the proper features constructed from historical lags and the original exogenous data.

The right approach

To obtain in-sample predictions, use the forecaster’s feature construction on the training series and exogenous data, then run the already-fitted regressor on those features. This yields the fitted values aligned to the original, predictable indices.

import pandas as pd
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from sklearn.linear_model import LinearRegression

# Sample data
ts = pd.Series([1, 2, 3, 4, 5], index=pd.date_range("2023-01-01", periods=5, freq="D"))
exo = pd.DataFrame({"var": [10, 20, 30, 40, 50]}, index=ts.index)

# Fit forecaster
autoreg = ForecasterAutoreg(regressor=LinearRegression(), lags=2)
autoreg.fit(y=ts, exog=exo)

# Proper in-sample prediction using the training design matrix
def fitted_insample(model, y_series, exog_frame=None):
    X_mat, y_obs = model.create_train_X_y(y=y_series, exog=exog_frame)
    y_hat = model.regressor.predict(X_mat)
    return pd.Series(y_hat, index=y_obs.index, name="in_sample_pred")

insample = fitted_insample(autoreg, y_series=ts, exog_frame=exo)
print(insample)

Why this matters

In-sample predictions serve a different purpose than future forecasting. They let you evaluate how the fitted model reproduces the training data under the exact lagged features and exogenous inputs used during fitting. This is essential for diagnostics, residual analysis, and sanity checks. Treating historical data as “future” via date shifting blurs that line and can mask issues in feature alignment or model fit.

Takeaway

If you need in-sample predictions, build the feature matrix from the original training window and exogenous data, then call the regressor on that matrix. If you need out-of-sample forecasts, ask the forecaster to step forward from the last training index with appropriately aligned future exogenous values. Keeping these two paths separate helps you evaluate fit correctly and forecast confidently.

forecasting python training-data validation