https://pytroubles.com/en/posts/id927-fixing-tfds-imdb-reviews-subwords8k-error-causes-fixes-version-pinning-in-tensorflow-datasets

Fixing TFDS 'imdb_reviews/subwords8k' error: causes, fixes, version pinning in TensorFlow Datasets

Resolve the TFDS 'BuilderConfig subwords8k not found' error for imdb_reviews: why it happens and how to fix it

Fixing TFDS 'imdb_reviews/subwords8k' error: causes, fixes, version pinning in TensorFlow Datasets

Hit the TFDS imdb_reviews/subwords8k error? See why 'BuilderConfig subwords8k not found' occurs, how to load base dataset, build tokenizer, or pin TFDS 4.4.0

2025-10-17T13:00:05+03:00

2025-10-17T13:00:06+03:00

When you follow older NLP tutorials that rely on TensorFlow Datasets (TFDS), you may run into a puzzling failure loading IMDB data with subword tokenization. The call succeeds in archived notebooks, yet on a modern setup it throws a configuration error and stops you before you even start modeling.What goes wrongValueError: Failed to construct dataset "imdb_reviews", builder_kwargs "{'config': 'subwords8k', 'data_dir': None}": BuilderConfig subwords8k not found with version None. Here are 10 BuilderConfigs whose name closely match:This typically appears in the context of course material where the dataset name includes a configuration suffix. The intent was to get IMDB reviews with a prebuilt subword tokenizer, but TFDS can no longer find that configuration.Minimal example that reproduces the errorimport tensorflow_datasets as tfd import tensorflow as tf stream, meta = tfd.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True) The logic is straightforward: import TFDS, and request the IMDB dataset with the subwords8k configuration. On current TFDS releases, this fails.Why it happensNewer TensorFlow Datasets no longer include the "subwords8k" config for imdb_reviews. In older course material, that config shipped with TFDS and provided a prebuilt tokenizer. As the dataset evolved, the config was removed, so TFDS cannot resolve the name and raises the error. This often coincides with outdated repositories where code was last updated years ago, while your environment is running a newer TFDS.The fixThe simplest path is to load the base dataset without the removed configuration. You still receive the IMDB reviews as supervised pairs, just without the built-in subword tokenizer.import tensorflow_datasets as tfd import tensorflow as tf reviews, details = tfd.load('imdb_reviews', with_info=True, as_supervised=True) If you specifically need the original behavior from the old tutorial, there are two pragmatic routes. You can use the base dataset and create your own tokenizer in place of the removed prebuilt one. Or you can run the original code unchanged by installing an older TFDS release from before the removal, for example tensorflow-datasets==4.4.0.Why this mattersData pipeline code is sensitive to dataset configurations that may change or disappear across library versions. A name like imdb_reviews/subwords8k looks stable, but it is tied to TFDS internals. When that configuration is removed, reproducibility breaks, and tutorials relying on it become brittle. Recognizing that the failure stems from a missing BuilderConfig helps you decide quickly whether to modernize the code or to pin your environment to a known-good version.TakeawaysIf you hit the "BuilderConfig subwords8k not found" error, switch to the base imdb_reviews dataset and proceed without the deprecated configuration, or match your TFDS version to the era of the tutorial, such as tensorflow-datasets==4.4.0. For long-lived projects, avoid depending on removed configs and keep an eye on how dataset names and options evolve over time. This small adjustment keeps your preprocessing stable and your experiments moving.

TensorFlow Datasets, TFDS, imdb_reviews, subwords8k, BuilderConfig subwords8k not found, IMDB reviews, error fix, tokenizer, version pinning, tensorflow-datasets 4.4.0, dataset configuration

2025

2025, Oct 17 13:00

Resolve the TFDS 'BuilderConfig subwords8k not found' error for imdb_reviews: why it happens and how to fix it

Hit the TFDS imdb_reviews/subwords8k error? See why 'BuilderConfig subwords8k not found' occurs, how to load base dataset, build tokenizer, or pin TFDS 4.4.0

What goes wrong

ValueError: Failed to construct dataset "imdb_reviews", builder_kwargs "{'config': 'subwords8k', 'data_dir': None}": BuilderConfig subwords8k not found with version None. Here are 10 BuilderConfigs whose name closely match:

This typically appears in the context of course material where the dataset name includes a configuration suffix. The intent was to get IMDB reviews with a prebuilt subword tokenizer, but TFDS can no longer find that configuration.

Minimal example that reproduces the error

import tensorflow_datasets as tfd
import tensorflow as tf
stream, meta = tfd.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)

The logic is straightforward: import TFDS, and request the IMDB dataset with the subwords8k configuration. On current TFDS releases, this fails.

Why it happens

Newer TensorFlow Datasets no longer include the "subwords8k" config for imdb_reviews. In older course material, that config shipped with TFDS and provided a prebuilt tokenizer. As the dataset evolved, the config was removed, so TFDS cannot resolve the name and raises the error. This often coincides with outdated repositories where code was last updated years ago, while your environment is running a newer TFDS.

The fix

The simplest path is to load the base dataset without the removed configuration. You still receive the IMDB reviews as supervised pairs, just without the built-in subword tokenizer.

import tensorflow_datasets as tfd
import tensorflow as tf
reviews, details = tfd.load('imdb_reviews', with_info=True, as_supervised=True)

If you specifically need the original behavior from the old tutorial, there are two pragmatic routes. You can use the base dataset and create your own tokenizer in place of the removed prebuilt one. Or you can run the original code unchanged by installing an older TFDS release from before the removal, for example tensorflow-datasets==4.4.0.

Why this matters

Data pipeline code is sensitive to dataset configurations that may change or disappear across library versions. A name like imdb_reviews/subwords8k looks stable, but it is tied to TFDS internals. When that configuration is removed, reproducibility breaks, and tutorials relying on it become brittle. Recognizing that the failure stems from a missing BuilderConfig helps you decide quickly whether to modernize the code or to pin your environment to a known-good version.

Takeaways

If you hit the "BuilderConfig subwords8k not found" error, switch to the base imdb_reviews dataset and proceed without the deprecated configuration, or match your TFDS version to the era of the tutorial, such as tensorflow-datasets==4.4.0. For long-lived projects, avoid depending on removed configs and keep an eye on how dataset names and options evolve over time. This small adjustment keeps your preprocessing stable and your experiments moving.

The article is based on a question from StackOverflow by Anas Limem and an answer by ZAST.AI.

artificial-intelligence machine-learning nlp python tensorflow