2025, Nov 07 03:00

Average MSE Correctly in PyTorch: Divide by Batches and Know When to Use Train vs Eval

Learn the correct way to average MSE in PyTorch: divide summed batch means by batches, not samples. Understand train vs eval mode for accurate training metrics.

How to correctly average MSE across batches and when to use train vs eval mode

When training a regression model, it’s common to summarize performance per epoch with Mean Squared Error. A subtle implementation detail easily skews this metric: whether you divide the accumulated batch losses by the number of samples in the dataset or by the number of batches. Another recurring question is which mode the model should be in when you compute a training metric.

Problem example

The following training loop accumulates MSE from each mini-batch and then divides by the dataset length. It also tracks a correlation metric for both splits.

def fit_and_validate(net, loader_tr, loader_va, epochs=40, lr_rate=1e-3, wd=1e-5, patience_n=5):
    net = net.to(device)
    mse_fn = nn.MSELoss()
    opt = optim.Adam(net.parameters(), lr=lr_rate, weight_decay=wd)
    best_mse_val = float('inf')
    stagnation = 0
    for ep in range(epochs):
        net.train()
        agg_train_loss, preds_tr, gts_tr = 0, [], []
        for feats, y in loader_tr:
            feats, y = feats.to(device), y.to(device)
            out = net(feats)
            loss_val = mse_fn(out, y)
            opt.zero_grad()
            loss_val.backward()
            opt.step()
            agg_train_loss += loss_val.item()
            preds_tr.extend(out.detach().cpu().numpy())
            gts_tr.extend(y.cpu().numpy())
        net.eval()
        agg_val_loss, preds_va, gts_va = 0, [], []
        with torch.no_grad():
            for feats, y in loader_va:
                feats, y = feats.to(device), y.to(device)
                out = net(feats)
                loss_val = mse_fn(out, y)
                agg_val_loss += loss_val.item()
                preds_va.extend(out.detach().cpu().numpy())
                gts_va.extend(y.cpu().numpy())
        mse_tr = agg_train_loss / len(train_data)
        pc_tr = robust_pearsonr(preds_tr, gts_tr)
        mse_va = agg_val_loss / len(val_data)
        pc_va = robust_pearsonr(preds_va, gts_va)

What’s going wrong and why

The per-batch loss produced by nn.MSELoss() is already a mean over the samples within that batch. During an epoch you sum one mean value per batch. By the time you finish the loop, you have the sum of batch means, not the sum of per-sample squared errors. To convert that sum of batch means into an epoch mean, the correct normalization is the number of batches in the epoch. That’s exactly the length of the data loader.

The fix

Divide the accumulated batch means by the number of batches, i.e., by the loader length for the corresponding split.

def fit_and_validate(net, loader_tr, loader_va, epochs=40, lr_rate=1e-3, wd=1e-5, patience_n=5):
    net = net.to(device)
    mse_fn = nn.MSELoss()
    opt = optim.Adam(net.parameters(), lr=lr_rate, weight_decay=wd)
    best_mse_val = float('inf')
    stagnation = 0
    for ep in range(epochs):
        net.train()
        agg_train_loss, preds_tr, gts_tr = 0, [], []
        for feats, y in loader_tr:
            feats, y = feats.to(device), y.to(device)
            out = net(feats)
            loss_val = mse_fn(out, y)
            opt.zero_grad()
            loss_val.backward()
            opt.step()
            agg_train_loss += loss_val.item()
            preds_tr.extend(out.detach().cpu().numpy())
            gts_tr.extend(y.cpu().numpy())
        net.eval()
        agg_val_loss, preds_va, gts_va = 0, [], []
        with torch.no_grad():
            for feats, y in loader_va:
                feats, y = feats.to(device), y.to(device)
                out = net(feats)
                loss_val = mse_fn(out, y)
                agg_val_loss += loss_val.item()
                preds_va.extend(out.detach().cpu().numpy())
                gts_va.extend(y.cpu().numpy())
        mse_tr = agg_train_loss / len(loader_tr)
        pc_tr = robust_pearsonr(preds_tr, gts_tr)
        mse_va = agg_val_loss / len(loader_va)
        pc_va = robust_pearsonr(preds_va, gts_va)

Which mode to use when tracking train MSE

If your loss function is nn.MSELoss, then your training loss is already the train MSE, and you are computing it in training mode naturally. If you wanted to compute MSE as a separate metric rather than as the loss, you could compute it in eval mode.

Why this detail matters

Normalizing by the dataset size when you have accumulated batch means distorts the epoch metric and can hide or exaggerate signal. Using the number of batches aligns the epoch average with what was actually aggregated. This makes train–validation comparisons more reliable and helps you judge whether the model is starting to overfit based on consistent numbers.

Bottom line

When you sum one mean-per-batch loss across an epoch, divide by the number of batches, not by the dataset size. If your loss is MSE, you already have train MSE from the training loop; if you compute it separately, compute it in eval mode. Keep these two points straight and your monitoring will reflect the model’s behavior accurately.

The article is based on a question from StackOverflow by mansi and an answer by nicod.