https://pytroubles.com/en/posts/id1793-tensorflow-local-rendezvous-is-aborting-cancelled-in-vertex-ai-distributed-training-benign

TensorFlow 'Local rendezvous is aborting: CANCELLED' in Vertex AI Distributed Training - Benign

Why TensorFlow Logs 'Local rendezvous is aborting: CANCELLED' During Distributed Training Are Harmless

TensorFlow 'Local rendezvous is aborting: CANCELLED' in Vertex AI Distributed Training - Benign

Learn why TensorFlow logs like 'Local rendezvous is aborting: CANCELLED' show in distributed training on Vertex AI. It's normal prefetch cleanup, not an error.

2025-11-10T11:00:09+03:00

Seeing a burst of TensorFlow logs during distributed training can be unsettling, especially when they mention cancellations. A common one is the message about Local rendezvous aborting with status CANCELLED that appears around the first epoch when running a Vertex AI pipeline with distributed strategies. The model still trains fine, so the natural question is whether this indicates a real problem.What you might seeLocal rendezvous is aborting with status: CANCELLED: GetNextFromShard was cancelled [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]]Minimal setup that reproduces itThe pattern typically involves a tf.distribute strategy, a tf.data.Dataset built from a generator, and model.fit with explicit steps per epoch.if cli_args.dist_mode == 'single': if tf.test.is_gpu_available(): dist_plan = tf.distribute.OneDeviceStrategy(device="/gpu:0") else: dist_plan = tf.distribute.OneDeviceStrategy(device="/cpu:0")elif cli_args.dist_mode == 'mirror': dist_plan = tf.distribute.MirroredStrategy()elif cli_args.dist_mode == 'multi': dist_plan = tf.distribute.experimental.MultiWorkerMirroredStrategy()def build_dataset(make_iterable) -> tuple[tf.data.Dataset, int]: tmp_stream = make_iterable() for x_probe, y_probe in tmp_stream: break n_items = sum(1 for _ in tmp_stream) + 1 pipe = tf.data.Dataset.from_generator( make_iterable, output_signature=( tf.TensorSpec(shape=(x_probe.shape), dtype=tf.float16), tf.TensorSpec(shape=(y_probe.shape), dtype=tf.float16) ) ) pipe = pipe.batch(BATCH_SZ).repeat() return pipe, n_itemswith dist_plan.scope(): train_feed, n_train = build_dataset(lambda: win_bundle.train) val_feed, n_val = build_dataset(lambda: win_bundle.val) print(n_train, n_val) steps_train = n_train // BATCH_SZ steps_val = n_val // BATCH_SZ model.fit( train_feed, validation_data=val_feed, epochs=n_epochs, validation_steps=steps_val, steps_per_epoch=steps_train, )What actually happens and whyThis message is an informational side-effect of how tf.data prefetching interacts with the Keras training loop in a distributed environment. During an epoch, each worker requests and processes batches while the input pipeline prefetches in the background to keep devices busy. When model.fit finishes train_steps and closes the epoch, the iterator for that epoch is terminated. If any worker has an outstanding request for the next batch as part of prefetching, that pending request is cancelled because the epoch has ended. TensorFlow then logs the cancellation: Local rendezvous is aborting with status: CANCELLED: GetNextFromShard was cancelled. This is the expected cleanup when an epoch ends and does not indicate a failure.What to changeNo change is required. The cancellation message simply reflects the end-of-epoch shutdown of in-flight prefetch requests. Wrapping your dataset with strategy.experimental_distribute_dataset is not a remedy for this specific message because it is not an error condition.Why this mattersUnderstanding this behavior helps separate noisy but benign runtime logs from genuine issues. In distributed training setups such as MirroredStrategy or MultiWorkerMirroredStrategy running on Vertex AI, tf.data and model.fit intentionally overlap work; the resulting cancellations at epoch boundaries are part of normal, efficient execution.TakeawaysIf your training proceeds and metrics update as expected, treat this log line as informational. Use explicit step counts only if they are part of your training design; there is no need to alter the input pipeline or distribution strategy just to silence this message.

TensorFlow CANCELLED log, Local rendezvous is aborting, distributed training, Vertex AI, tf.data prefetch, Keras model.fit, MirroredStrategy, MultiWorkerMirroredStrategy, tf.distribute

2025

2025, Nov 10 11:00

Why TensorFlow Logs 'Local rendezvous is aborting: CANCELLED' During Distributed Training Are Harmless

Learn why TensorFlow logs like 'Local rendezvous is aborting: CANCELLED' show in distributed training on Vertex AI. It's normal prefetch cleanup, not an error.

What you might see

Local rendezvous is aborting with status: CANCELLED: GetNextFromShard was cancelled [[{{node MultiDeviceIteratorGetNextFromShard}}]] [[RemoteCall]]

Minimal setup that reproduces it

The pattern typically involves a tf.distribute strategy, a tf.data.Dataset built from a generator, and model.fit with explicit steps per epoch.

if cli_args.dist_mode == 'single':
    if tf.test.is_gpu_available():
        dist_plan = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        dist_plan = tf.distribute.OneDeviceStrategy(device="/cpu:0")
elif cli_args.dist_mode == 'mirror':
    dist_plan = tf.distribute.MirroredStrategy()
elif cli_args.dist_mode == 'multi':
    dist_plan = tf.distribute.experimental.MultiWorkerMirroredStrategy()

def build_dataset(make_iterable) -> tuple[tf.data.Dataset, int]:
    tmp_stream = make_iterable()
    for x_probe, y_probe in tmp_stream:
        break
    n_items = sum(1 for _ in tmp_stream) + 1
    pipe = tf.data.Dataset.from_generator(
        make_iterable,
        output_signature=(
            tf.TensorSpec(shape=(x_probe.shape), dtype=tf.float16),
            tf.TensorSpec(shape=(y_probe.shape), dtype=tf.float16)
        )
    )
    pipe = pipe.batch(BATCH_SZ).repeat()
    return pipe, n_items

with dist_plan.scope():
    train_feed, n_train = build_dataset(lambda: win_bundle.train)
    val_feed, n_val = build_dataset(lambda: win_bundle.val)
    print(n_train, n_val)
    steps_train = n_train // BATCH_SZ
    steps_val = n_val // BATCH_SZ

    model.fit(
        train_feed,
        validation_data=val_feed,
        epochs=n_epochs,
        validation_steps=steps_val,
        steps_per_epoch=steps_train,
    )

What actually happens and why

This message is an informational side-effect of how tf.data prefetching interacts with the Keras training loop in a distributed environment. During an epoch, each worker requests and processes batches while the input pipeline prefetches in the background to keep devices busy. When model.fit finishes train_steps and closes the epoch, the iterator for that epoch is terminated. If any worker has an outstanding request for the next batch as part of prefetching, that pending request is cancelled because the epoch has ended. TensorFlow then logs the cancellation: Local rendezvous is aborting with status: CANCELLED: GetNextFromShard was cancelled. This is the expected cleanup when an epoch ends and does not indicate a failure.

What to change

No change is required. The cancellation message simply reflects the end-of-epoch shutdown of in-flight prefetch requests. Wrapping your dataset with strategy.experimental_distribute_dataset is not a remedy for this specific message because it is not an error condition.

Why this matters

Understanding this behavior helps separate noisy but benign runtime logs from genuine issues. In distributed training setups such as MirroredStrategy or MultiWorkerMirroredStrategy running on Vertex AI, tf.data and model.fit intentionally overlap work; the resulting cancellations at epoch boundaries are part of normal, efficient execution.

Takeaways

If your training proceeds and metrics update as expected, treat this log line as informational. Use explicit step counts only if they are part of your training design; there is no need to alter the input pipeline or distribution strategy just to silence this message.

The article is based on a question from StackOverflow by alberto sansegundo and an answer by jggp1094.

google-cloud-vertex-ai python tensorflow tensorflow-datasets