2025, Oct 17 13:00
Resolve the TFDS 'BuilderConfig subwords8k not found' error for imdb_reviews: why it happens and how to fix it
Hit the TFDS imdb_reviews/subwords8k error? See why 'BuilderConfig subwords8k not found' occurs, how to load base dataset, build tokenizer, or pin TFDS 4.4.0
When you follow older NLP tutorials that rely on TensorFlow Datasets (TFDS), you may run into a puzzling failure loading IMDB data with subword tokenization. The call succeeds in archived notebooks, yet on a modern setup it throws a configuration error and stops you before you even start modeling.
What goes wrong
ValueError: Failed to construct dataset "imdb_reviews", builder_kwargs "{'config': 'subwords8k', 'data_dir': None}": BuilderConfig subwords8k not found with version None. Here are 10 BuilderConfigs whose name closely match:
This typically appears in the context of course material where the dataset name includes a configuration suffix. The intent was to get IMDB reviews with a prebuilt subword tokenizer, but TFDS can no longer find that configuration.
Minimal example that reproduces the error
import tensorflow_datasets as tfd
import tensorflow as tf
stream, meta = tfd.load('imdb_reviews/subwords8k', with_info=True, as_supervised=True)
The logic is straightforward: import TFDS, and request the IMDB dataset with the subwords8k configuration. On current TFDS releases, this fails.
Why it happens
Newer TensorFlow Datasets no longer include the "subwords8k" config for imdb_reviews. In older course material, that config shipped with TFDS and provided a prebuilt tokenizer. As the dataset evolved, the config was removed, so TFDS cannot resolve the name and raises the error. This often coincides with outdated repositories where code was last updated years ago, while your environment is running a newer TFDS.
The fix
The simplest path is to load the base dataset without the removed configuration. You still receive the IMDB reviews as supervised pairs, just without the built-in subword tokenizer.
import tensorflow_datasets as tfd
import tensorflow as tf
reviews, details = tfd.load('imdb_reviews', with_info=True, as_supervised=True)
If you specifically need the original behavior from the old tutorial, there are two pragmatic routes. You can use the base dataset and create your own tokenizer in place of the removed prebuilt one. Or you can run the original code unchanged by installing an older TFDS release from before the removal, for example tensorflow-datasets==4.4.0.
Why this matters
Data pipeline code is sensitive to dataset configurations that may change or disappear across library versions. A name like imdb_reviews/subwords8k looks stable, but it is tied to TFDS internals. When that configuration is removed, reproducibility breaks, and tutorials relying on it become brittle. Recognizing that the failure stems from a missing BuilderConfig helps you decide quickly whether to modernize the code or to pin your environment to a known-good version.
Takeaways
If you hit the "BuilderConfig subwords8k not found" error, switch to the base imdb_reviews dataset and proceed without the deprecated configuration, or match your TFDS version to the era of the tutorial, such as tensorflow-datasets==4.4.0. For long-lived projects, avoid depending on removed configs and keep an eye on how dataset names and options evolve over time. This small adjustment keeps your preprocessing stable and your experiments moving.
The article is based on a question from StackOverflow by Anas Limem and an answer by ZAST.AI.