2025, Oct 27 17:00

Batch Size at Inference in PyTorch: Will Full-Batch vs Mini-Batch Change Your Model's Outputs?

Learn whether batch size affects inference outputs in PyTorch. Compare full-batch vs mini-batch, see code patterns, plus a sanity check for matching predictions.

Batch size is one of those knobs that matter a lot during training, so it’s natural to wonder whether it also constrains how you should run inference. If you already have a trained PyTorch model and want to score new data, is it acceptable to pass the entire dataset in a single batch instead of iterating through smaller chunks? And will choosing a different batch size at prediction time change the outputs?

Example of the inference pattern in question

The pattern below loads all samples into one batch and runs a single forward pass:

def infer_once(self, inputs):
    ds_loader = DataLoader(
        inputs,
        batch_size=inputs.shape[0],
        shuffle=False,
    )

    minibatch = next(iter(ds_loader))
    with torch.no_grad():
        outputs = self.net(minibatch)

    return outputs

What actually changes with batch size

Training and inference use batches for different reasons. During training, batches approximate the overall data distribution so that the gradient computed by backprop aligns well with the true gradient. That’s why batch size interacts with other optimization choices and often gets tuned. During inference, the goal is simply to apply the already learned parameters to new samples. Batching then is mainly about parallelism and throughput: grouping samples lets you use the accelerator more efficiently, but it does not change the function the model computes on each item.

In normal practice, a trained model should produce the same result for a given input sample regardless of which other samples share its batch. The batch size you choose at inference time is generally independent from what you used during training. The outputs for the same inputs should match for any reasonable batch size, aside from the tiny numerical differences that come with floating-point arithmetic.

Practical takeaway and how to structure inference

If your hardware can handle it, running full-batch inference can be faster because it maximizes parallel work with no gradient computation. If it cannot, use a smaller batch size that fits your memory or latency constraints. Either way, the predicted values for identical inputs should not depend on how many items are grouped together.

Here is a compact pattern for inference that lets you choose any batch size while preserving the per-sample results. It accumulates outputs across batches when you don’t pass the entire dataset at once.

def infer_batched(self, inputs, bsz=None):
    size = inputs.shape[0] if bsz is None else bsz
    ds_loader = DataLoader(
        inputs,
        batch_size=size,
        shuffle=False,
    )

    preds = []
    with torch.no_grad():
        for part in ds_loader:
            preds.append(self.net(part))

    return torch.cat(preds, dim=0)

Does changing batch size affect results?

It shouldn’t. A simple sanity check is to run the same inputs through the same model parameters with two different batch sizes and compare the outputs for equality within floating-point tolerance. Identical inputs should yield the same predictions item-wise, independent of batching.

# Example sanity check
scores_1 = model.infer_batched(X, bsz=1)
scores_2 = model.infer_batched(X, bsz=X.shape[0])

ok = torch.allclose(scores_1, scores_2)

Why this matters

Separating the roles of batch size helps you optimize the right thing at the right time. During training, tune batch size to interact well with learning rate and optimization. During inference, choose batch size for efficiency and resource constraints, confident that predictions for each sample remain the same across batching strategies.

Conclusion

You are free to pick any reasonable batch size at inference time. Using a full dataset as one batch is fine if it fits in memory and meets your throughput or latency goals. If you prefer, process data in smaller batches without worrying about changes in predicted outputs. When in doubt, perform a quick equality check across two batch sizes to validate that the model treats samples independently during prediction.

The article is based on a question from StackOverflow by pgaluzio and an answer by simon.