https://pytroubles.com/en/posts/id1953-pytorch-lightning-hang-on-macos-mps-training-finishes-but-process-won-t-exit-fix-workarounds

PyTorch Lightning Hang on macOS MPS: Training Finishes but Process Won't Exit (Fix & Workarounds)

Fixing PyTorch Lightning Shutdown Hangs on macOS (MPS): Training Ends but the Process Never Exits

PyTorch Lightning Hang on macOS MPS: Training Finishes but Process Won't Exit (Fix & Workarounds)

PyTorch Lightning hang on macOS MPS: training ends but the process won't exit. Cause: DataLoader num_workers. Fix: update to PyTorch nightly. Keep workers fast

2025-11-18T11:00:10+03:00

When training a basic CIFAR‑10 model with PyTorch Lightning on a Mac Studio with an M4 Max, the training loop may finish successfully yet the process refuses to exit. The console reports that training stopped because max_epochs was reached, but the program then stalls indefinitely until you interrupt it manually. Reducing num_workers to zero avoids the hang at the cost of slower data loading. Removing the validation loader does not change the behavior, and the issue also reproduces with num_workers set to one.Minimal example that reproduces the hangThe following Lightning script mirrors a standard CIFAR‑10 training setup and exhibits the termination issue after a successful run. The core logic remains unchanged while identifiers are renamed for clarity.import torch import torch.nn as nn import torch.nn.functional as F import lightning as L from torch.utils.data import DataLoader from torchvision import datasets, transforms class TinyCifarNet(L.LightningModule): def __init__(self): super().__init__() self.conv_a = nn.Conv2d(3, 32, 3, padding=1) self.conv_b = nn.Conv2d(32, 64, 3, padding=1) self.conv_c = nn.Conv2d(64, 64, 3, padding=1) self.down = nn.MaxPool2d(2, 2) self.dense_a = nn.Linear(64 * 4 * 4, 512) self.dense_b = nn.Linear(512, 10) def forward(self, data): data = self.down(F.relu(self.conv_a(data))) data = self.down(F.relu(self.conv_b(data))) data = self.down(F.relu(self.conv_c(data))) data = data.view(-1, 64 * 4 * 4) data = F.relu(self.dense_a(data)) data = self.dense_b(data) return data def training_step(self, pack, step_idx): inputs, targets = pack scores = self(inputs) loss = F.cross_entropy(scores, targets) top1 = (scores.argmax(1) == targets).float().mean() self.log("train_loss", loss) self.log("train_acc", top1) return loss def validation_step(self, pack, step_idx): inputs, targets = pack scores = self(inputs) loss = F.cross_entropy(scores, targets) top1 = (scores.argmax(1) == targets).float().mean() self.log("val_loss", loss) self.log("val_acc", top1) def test_step(self, pack, step_idx): inputs, targets = pack scores = self(inputs) loss = F.cross_entropy(scores, targets) top1 = (scores.argmax(1) == targets).float().mean() self.log("test_loss", loss) self.log("test_acc", top1) def configure_optimizers(self): opt = torch.optim.Adam(self.parameters(), lr=1e-3) return opt if __name__ == "__main__": aug_train = transforms.Compose([ transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) aug_eval = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)), ]) ds_train = datasets.CIFAR10(root="./data", train=True, download=True, transform=aug_train) ds_eval = datasets.CIFAR10(root="./data", train=False, download=True, transform=aug_eval) ldr_train = DataLoader(ds_train, batch_size=64, shuffle=True, num_workers=14, persistent_workers=True) ldr_eval = DataLoader(ds_eval, batch_size=64, shuffle=False, num_workers=14, persistent_workers=True) net = TinyCifarNet() runner = L.Trainer(max_epochs=5, accelerator="mps", devices="auto") runner.fit(net, ldr_train, ldr_eval) What’s actually happeningThe symptom looks like a Lightning problem, but it is not specific to Lightning. A plain PyTorch training loop shows the same behavior on this setup, and the process refuses to terminate cleanly after training ends. Lowering num_workers to zero allows the script to exit, which strongly suggests the issue is tied to how worker processes are torn down. The environment where this was observed used PyTorch 2.7.1, PyTorch Lightning 2.5.1, macOS 15.5, MPS acceleration on an M4 Max Mac Studio, and the hang reproduced even with num_workers set to one. A very basic tutorial script from Lightning may appear unaffected, which explains why the issue can be confusing at first glance.Fix that worked reliablyTwo approaches helped in practice. A quick exit can be forced with os._exit(0), but that simply terminates the interpreter and bypasses normal cleanup. The durable fix was to update PyTorch to a nightly build, which resolved the termination hang in this environment.pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu At the time of writing, this installs a development build of the upcoming 2.8 series and eliminates the issue described above.Why you should careSilent hangs on shutdown are easy to miss in CI pipelines, notebooks, or long-running jobs scheduled on shared hardware. They waste GPU or MPS time, hold on to file handles, and complicate automation around training jobs. Knowing that the behavior is not Lightning-specific and that an updated PyTorch build fixes it can save hours of debugging loader settings and callbacks that are not at fault.Practical wrap‑upIf your PyTorch Lightning training completes but the process never exits on macOS with MPS, do not immediately rework the training loop or remove functionality. Confirm that the same behavior appears in a minimal PyTorch script. If it does, consider updating to a recent PyTorch nightly build, which in this case resolved the issue cleanly. Setting num_workers to zero is a functional workaround but slows data loading, so it is best reserved only as a temporary measure. Keeping the framework stack current is often the simplest way to avoid edge cases in multiprocessing teardown.

PyTorch Lightning, macOS MPS, shutdown hang, process won't exit, DataLoader, num_workers, persistent_workers, Mac Studio M4 Max, PyTorch nightly, CIFAR-10, training loop, multiprocessing

2025

2025, Nov 18 11:00

Fixing PyTorch Lightning Shutdown Hangs on macOS (MPS): Training Ends but the Process Never Exits

PyTorch Lightning hang on macOS MPS: training ends but the process won't exit. Cause: DataLoader num_workers. Fix: update to PyTorch nightly. Keep workers fast

Minimal example that reproduces the hang

The following Lightning script mirrors a standard CIFAR‑10 training setup and exhibits the termination issue after a successful run. The core logic remains unchanged while identifiers are renamed for clarity.

import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
class TinyCifarNet(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.conv_a = nn.Conv2d(3, 32, 3, padding=1)
        self.conv_b = nn.Conv2d(32, 64, 3, padding=1)
        self.conv_c = nn.Conv2d(64, 64, 3, padding=1)
        self.down = nn.MaxPool2d(2, 2)
        self.dense_a = nn.Linear(64 * 4 * 4, 512)
        self.dense_b = nn.Linear(512, 10)
    def forward(self, data):
        data = self.down(F.relu(self.conv_a(data)))
        data = self.down(F.relu(self.conv_b(data)))
        data = self.down(F.relu(self.conv_c(data)))
        data = data.view(-1, 64 * 4 * 4)
        data = F.relu(self.dense_a(data))
        data = self.dense_b(data)
        return data
    def training_step(self, pack, step_idx):
        inputs, targets = pack
        scores = self(inputs)
        loss = F.cross_entropy(scores, targets)
        top1 = (scores.argmax(1) == targets).float().mean()
        self.log("train_loss", loss)
        self.log("train_acc", top1)
        return loss
    def validation_step(self, pack, step_idx):
        inputs, targets = pack
        scores = self(inputs)
        loss = F.cross_entropy(scores, targets)
        top1 = (scores.argmax(1) == targets).float().mean()
        self.log("val_loss", loss)
        self.log("val_acc", top1)
    def test_step(self, pack, step_idx):
        inputs, targets = pack
        scores = self(inputs)
        loss = F.cross_entropy(scores, targets)
        top1 = (scores.argmax(1) == targets).float().mean()
        self.log("test_loss", loss)
        self.log("test_acc", top1)
    def configure_optimizers(self):
        opt = torch.optim.Adam(self.parameters(), lr=1e-3)
        return opt
if __name__ == "__main__":
    aug_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ])
    aug_eval = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
    ])
    ds_train = datasets.CIFAR10(root="./data", train=True, download=True, transform=aug_train)
    ds_eval = datasets.CIFAR10(root="./data", train=False, download=True, transform=aug_eval)
    ldr_train = DataLoader(ds_train, batch_size=64, shuffle=True, num_workers=14, persistent_workers=True)
    ldr_eval = DataLoader(ds_eval, batch_size=64, shuffle=False, num_workers=14, persistent_workers=True)
    net = TinyCifarNet()
    runner = L.Trainer(max_epochs=5, accelerator="mps", devices="auto")
    runner.fit(net, ldr_train, ldr_eval)

What’s actually happening

The symptom looks like a Lightning problem, but it is not specific to Lightning. A plain PyTorch training loop shows the same behavior on this setup, and the process refuses to terminate cleanly after training ends. Lowering num_workers to zero allows the script to exit, which strongly suggests the issue is tied to how worker processes are torn down. The environment where this was observed used PyTorch 2.7.1, PyTorch Lightning 2.5.1, macOS 15.5, MPS acceleration on an M4 Max Mac Studio, and the hang reproduced even with num_workers set to one. A very basic tutorial script from Lightning may appear unaffected, which explains why the issue can be confusing at first glance.

Fix that worked reliably

Two approaches helped in practice. A quick exit can be forced with os._exit(0), but that simply terminates the interpreter and bypasses normal cleanup. The durable fix was to update PyTorch to a nightly build, which resolved the termination hang in this environment.

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

At the time of writing, this installs a development build of the upcoming 2.8 series and eliminates the issue described above.

Why you should care

Silent hangs on shutdown are easy to miss in CI pipelines, notebooks, or long-running jobs scheduled on shared hardware. They waste GPU or MPS time, hold on to file handles, and complicate automation around training jobs. Knowing that the behavior is not Lightning-specific and that an updated PyTorch build fixes it can save hours of debugging loader settings and callbacks that are not at fault.

Practical wrap‑up

If your PyTorch Lightning training completes but the process never exits on macOS with MPS, do not immediately rework the training loop or remove functionality. Confirm that the same behavior appears in a minimal PyTorch script. If it does, consider updating to a recent PyTorch nightly build, which in this case resolved the issue cleanly. Setting num_workers to zero is a functional workaround but slows data loading, so it is best reserved only as a temporary measure. Keeping the framework stack current is often the simplest way to avoid edge cases in multiprocessing teardown.

python pytorch pytorch-lightning