Changelogs » Tensorflow-datasets

PyUp Safety actively tracks 267,404 Python packages for vulnerabilities and notifies you when to upgrade.



* Fix `tfds.load` when generation code isn't present
  * Fix improve GCS compatibility.
  Thanks carlthome for reporting and fixing the issue.


**API changes, new features:**
  * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
  * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See 2493).
  * Add a new TFDS CLI (see for detail)
  * `tfds.testing.mock_data` does not require metadata files anymore!
  * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](
  * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
  * Add new `DatasetBuilder.RELEASE_NOTES` property
  * tfds.features.Image now supports PNG with 4-channels
  * `tfds.ImageFolder` now supports custom shape, dtype
  * Downloaded URLs are available through `MyDataset.url_infos`
  * Add `skip_prefetch` option to `tfds.ReadConfig`
  * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`
  **Breaking compatible changes:**
  * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
  * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
  * Remove `DatasetBuilder.IN_DEVELOPMENT` property
  * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
  * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content]( and `FeatureConnector.to_json_content` to support this feature.
  * Stop testing against TF 1.15. Requires Python 3.6.8+.
  **Other bug fixes:**
  * Better archive extension detection for `dl_manager.download_and_extract`
  * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
  * Fix crash when GCS not available
  * Script to detect dead-urls
  * Improved open-source workflow, contributor guide, documentation
  * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
  And of course, new datasets, datasets updates.
  A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix for being a major contributor.


* Fix an issue with GCS on Windows.


**Future breaking change:**
  * The `tfds.features.text` encoding API is deprecated. Please use [tensorflow_text]( instead.
  **New features**
  * Add a `tfds.ImageFolder` and `tfds.TranslateFolder` to easily create custom datasets with your custom data.
  * Add a `tfds.ReadConfig(input_context=)` to shard dataset, for better multi-worker compatibility (1426).
  * The default `data_dir` can be controlled by the `TFDS_DATA_DIR` environment variable.
  * Better usability when developing datasets outside TFDS
  * Downloads are always cached
  * Checksum are optional
  * Added a `tfds.show_statistics(ds_info)` to display [FACETS OVERVIEW]( Note: This require the dataset to have been generated with the statistics.
  * Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)
  * Catalog display images ([example](
  * Catalog shows which dataset have been recently added and are only available in `tfds-nightly` <span class="material-icons">nights_stay</span>
  Breaking compatibility change:
  * Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
  * Remove `tfds.load('image_label_folder')` in favor of the more user-friendly `tfds.ImageFolder`
  * Various performances improvements for both generation and reading (e.g. use `__slot__`, fix parallelisation bug in ``,...)
  * Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)
  Thanks to all our contributors who help improving the state of dataset for the entire research community!


**Beaking compatibility change:**
  * Rename `tfds.core.NamedSplit`, `tfds.core.SplitBase` -> `tfds.Split`. Now `tfds.Split.TRAIN`,... are instance of `tfds.Split`
  * Remove deprecated `num_shards` argument from `tfds.core.SplitGenerator`. This argument was ignored as shards are automatically computed.
  **Future breaking compatibility changes:**
  * Rename `interleave_parallel_reads` -> `interleave_cycle_length` for `tfds.ReadConfig`.
  * Invert ds, ds_info argument orders for `tfds.show_examples`Future breaking change:
  * The `tfds.features.text` encoding API is deprecated. Please use `tensorflow_text` instead.
  Other changes:
  * Testing: Add support for custom decoders in `tfds.testing.mock_data`
  * Documentation: shows which datasets are only present in `tfds-nightly`
  * Documentation: display images for supported datasets
  * API: Add `tfds.builder_cls(name)` to access a DatasetBuilder class by name
  * API: Add `info.split['train'].filenames` for access to the tf-record files.
  * API: Add `tfds.core.add_data_dir` to register an additional data dir
  * Remove most `ds.with_options` which where applied by TFDS. Now use default.
  * Other bug fixes and improvement (Better error messages, windows compatibility,...)
  Thank you all for your contributions, and helping us make TFDS better for everyone!


Breaking changes:
  * Legacy mode `tfds.experiment.S3` has been removed
  * New  `image_classification` section. Some datasets have been move there from `images`.
  * `in_memory` argument has been removed from `as_dataset`/`tfds.load` (small datasets are now auto-cached).
  * `DownloadConfig` do not append the dataset name anymore (manual data should be in `<manual_dir>/` instead of `<manual_dir>/<dataset_name>/`)
  * Tests now check that all `` urls has registered checksums. To opt-out, add `SKIP_CHECKSUMS = True` to your `DatasetBuilderTestCase`.
  * `tfds.load` now always returns `tf.compat.v2.Dataset`. If you're using still using `tf.compat.v1`:
  * Use `` rather than `ds.make_one_shot_iterator()`
  * Use `isinstance(ds, tf.compat.v2.Dataset)` instead of `isinstance(ds,`
  * `tfds.Split.ALL` has been removed from the API.
  Future breaking change:
  * The `tfds.features.text` encoding API is deprecated. Please use [tensorflow_text]( instead.
  * `num_shards` argument of `tfds.core.SplitGenerator` is currently ignored and will be removed in the next version.
  * `DownloadManager` is now pickable (can be used inside Beam pipelines)
  * `tfds.features.Audio`:
  * Support float as returned value
  * Expose sample_rate through `info.features['audio'].sample_rate`
  * Support for encoding audio features from file objects
  * Various bug fixes, better error messages, documentation improvements
  * More datasets
  Thank you to all our contributors for helping us make TFDS better for everyone!


New features:
  * Datasets expose `info.dataset_size` and `info.download_size`. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with `2.1.0` however).
  * [Auto-caching small datasets]( `in_memory` argument is deprecated and will be removed in a future version.
  * Datasets expose their cardinality `num_examples =` (Requires tf-nightly or TF >= 2.2.0)
  * Get the number of example in a sub-splits with: `info.splits['train[70%:]'].num_examples`


* This is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
  * The default versions of all datasets are now using the S3 slicing API. See the [guide]( for details.
  * The previous split API is still available, but is deprecated. If you wrote `DatasetBuilder`s outside the TFDS repository, please make sure they do not use `experiments={tfds.core.Experiment.S3: False}`. This will be removed in the next version, as well as the `num_shards` kwargs from `SplitGenerator`.
  * Several new datasets. Thanks to all the [contributors](!
  * API changes and new features:
  * `shuffle_files` defaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the new `read_config` parameter in [`tfds.load`](
  * `urls` kwargs renamed `homepage` in `DatasetInfo`
  * Support for nested `tfds.features.Sequence` and `tf.RaggedTensor`
  * Custom `FeatureConnector`s can override the `decode_batch_example` method for efficient decoding when wrapped inside a `tfds.features.Sequence(my_connector)`
  * Declaring a dataset in Colab won't register it, which allow to re-run the cell without having to change the name
  * Beam datasets can use a `tfds.core.BeamMetadataDict` to store additional metadata computed as part of the Beam pipeline.
  * Beam datasets' `_split_generators` accepts an additional `pipeline` kwargs to define a pipeline shared between all splits.
  * Various other bug fixes and performance improvements. Thank you for all the reports and fixes!


Bug fixes and performance improvements.


  - Add `shuffle_files` argument to `tfds.load` function. The semantic is the same as in `builder.as_dataset` function, which for now means that by default, files will be shuffled for `TRAIN` split, and not for other splits. Default behaviour will change to always be False at next release.
  - Most datasets now support the new S3 API ([documentation](
  - Support for uint16 PNG images
  - Crash while shuffling on Windows
  - Various documentation improvements
  New datasets
  - AFLW2000-3D
  - Amazon_US_Reviews
  - binarized_mnist
  - BinaryAlphaDigits
  - Caltech Birds 2010
  - Coil100
  - DeepWeeds
  - Food101
  - MIT Scene Parse 150
  - RockYou leaked password
  - Stanford Dogs
  - Stanford Online Products
  - Visual Domain Decathlon


  *   Add `in_memory` option to cache small dataset in RAM.
  *   Better sharding, shuffling and sub-split
  *   It is now possible to add arbitrary metadata to `tfds.core.DatasetInfo`
  which will be stored/restored with the dataset. See `tfds.core.Metadata`.
  *   Better proxy support, possibility to add certificate
  *   Add `decoders` kwargs to override the default feature decoding
  New datasets
  More datasets added:
  *  [downsampled_imagenet](
  *  [patch_camelyon](
  *  [coco]( 2017 (with and without panoptic annotations)
  * uc_merced
  * trivia_qa
  * super_glue
  * so2sat
  * snli
  * resisc45
  * pet_finder
  * mnist_corrupted
  * kitti
  * eurosat
  * definite_pronoun_resolution
  * curated_breast_imaging_ddsm
  * clevr
  * bigearthnet


* Add [Apache Beam support](
  * Add direct GCS access for MNIST (with `tfds.load('mnist', try_gcs=True)`)
  * More datasets added
  * Option to turn off tqdm bar (`tfds.disable_progress_bar()`)
  * Subsplit do not depends on the number of shard anymore (
  * Various bug fixes
  Thanks to all external contributors for raising issues, their feedback and their pull request.


* Fixes bug 52 that was putting the process in Eager mode by default
  * New dataset [`celeb_a_hq`](


*Note that this release had a bug 52 that was putting the process in Eager mode.*
  `tensorflow-datasets` is ready-for-use! Please see our [`README`]( and documentation linked there. We've got [25 datasets]( currently and are adding more. Please join in and [add]( (or [request]( a dataset yourself.