2025, Nov 18 11:00
Fixing PyTorch Lightning Shutdown Hangs on macOS (MPS): Training Ends but the Process Never Exits
PyTorch Lightning hang on macOS MPS: training ends but the process won't exit. Cause: DataLoader num_workers. Fix: update to PyTorch nightly. Keep workers fast
When training a basic CIFAR‑10 model with PyTorch Lightning on a Mac Studio with an M4 Max, the training loop may finish successfully yet the process refuses to exit. The console reports that training stopped because max_epochs was reached, but the program then stalls indefinitely until you interrupt it manually. Reducing num_workers to zero avoids the hang at the cost of slower data loading. Removing the validation loader does not change the behavior, and the issue also reproduces with num_workers set to one.
Minimal example that reproduces the hang
The following Lightning script mirrors a standard CIFAR‑10 training setup and exhibits the termination issue after a successful run. The core logic remains unchanged while identifiers are renamed for clarity.
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
class TinyCifarNet(L.LightningModule):
def __init__(self):
super().__init__()
self.conv_a = nn.Conv2d(3, 32, 3, padding=1)
self.conv_b = nn.Conv2d(32, 64, 3, padding=1)
self.conv_c = nn.Conv2d(64, 64, 3, padding=1)
self.down = nn.MaxPool2d(2, 2)
self.dense_a = nn.Linear(64 * 4 * 4, 512)
self.dense_b = nn.Linear(512, 10)
def forward(self, data):
data = self.down(F.relu(self.conv_a(data)))
data = self.down(F.relu(self.conv_b(data)))
data = self.down(F.relu(self.conv_c(data)))
data = data.view(-1, 64 * 4 * 4)
data = F.relu(self.dense_a(data))
data = self.dense_b(data)
return data
def training_step(self, pack, step_idx):
inputs, targets = pack
scores = self(inputs)
loss = F.cross_entropy(scores, targets)
top1 = (scores.argmax(1) == targets).float().mean()
self.log("train_loss", loss)
self.log("train_acc", top1)
return loss
def validation_step(self, pack, step_idx):
inputs, targets = pack
scores = self(inputs)
loss = F.cross_entropy(scores, targets)
top1 = (scores.argmax(1) == targets).float().mean()
self.log("val_loss", loss)
self.log("val_acc", top1)
def test_step(self, pack, step_idx):
inputs, targets = pack
scores = self(inputs)
loss = F.cross_entropy(scores, targets)
top1 = (scores.argmax(1) == targets).float().mean()
self.log("test_loss", loss)
self.log("test_acc", top1)
def configure_optimizers(self):
opt = torch.optim.Adam(self.parameters(), lr=1e-3)
return opt
if __name__ == "__main__":
aug_train = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
aug_eval = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
ds_train = datasets.CIFAR10(root="./data", train=True, download=True, transform=aug_train)
ds_eval = datasets.CIFAR10(root="./data", train=False, download=True, transform=aug_eval)
ldr_train = DataLoader(ds_train, batch_size=64, shuffle=True, num_workers=14, persistent_workers=True)
ldr_eval = DataLoader(ds_eval, batch_size=64, shuffle=False, num_workers=14, persistent_workers=True)
net = TinyCifarNet()
runner = L.Trainer(max_epochs=5, accelerator="mps", devices="auto")
runner.fit(net, ldr_train, ldr_eval)
What’s actually happening
The symptom looks like a Lightning problem, but it is not specific to Lightning. A plain PyTorch training loop shows the same behavior on this setup, and the process refuses to terminate cleanly after training ends. Lowering num_workers to zero allows the script to exit, which strongly suggests the issue is tied to how worker processes are torn down. The environment where this was observed used PyTorch 2.7.1, PyTorch Lightning 2.5.1, macOS 15.5, MPS acceleration on an M4 Max Mac Studio, and the hang reproduced even with num_workers set to one. A very basic tutorial script from Lightning may appear unaffected, which explains why the issue can be confusing at first glance.
Fix that worked reliably
Two approaches helped in practice. A quick exit can be forced with os._exit(0), but that simply terminates the interpreter and bypasses normal cleanup. The durable fix was to update PyTorch to a nightly build, which resolved the termination hang in this environment.
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
At the time of writing, this installs a development build of the upcoming 2.8 series and eliminates the issue described above.
Why you should care
Silent hangs on shutdown are easy to miss in CI pipelines, notebooks, or long-running jobs scheduled on shared hardware. They waste GPU or MPS time, hold on to file handles, and complicate automation around training jobs. Knowing that the behavior is not Lightning-specific and that an updated PyTorch build fixes it can save hours of debugging loader settings and callbacks that are not at fault.
Practical wrap‑up
If your PyTorch Lightning training completes but the process never exits on macOS with MPS, do not immediately rework the training loop or remove functionality. Confirm that the same behavior appears in a minimal PyTorch script. If it does, consider updating to a recent PyTorch nightly build, which in this case resolved the issue cleanly. Setting num_workers to zero is a functional workaround but slows data loading, so it is best reserved only as a temporary measure. Keeping the framework stack current is often the simplest way to avoid edge cases in multiprocessing teardown.