Changelogs » Datasets

PyUp Safety actively tracks 362,670 Python packages for vulnerabilities and notifies you when to upgrade.

Datasets

4.4.0

**API**:
  
  * Add [`PartialDecoding` support](https://www.tensorflow.org/datasets/decode#only_decode_a_sub-set_of_the_features), to decode only a subset of the features (for performances)
  * Catalog now expose links to [KnowYourData visualisations](https://knowyourdata-tfds.withgoogle.com/)
  * `tfds.as_numpy` supports datasets with `None`
  * Dataset generated with `disable_shuffling=True` are now read in generation order.
  * Loading datasets from files now supports custom `tfds.features.FeatureConnector`
  * `tfds.testing.mock_data` now supports
  * non-scalar tensors with dtype `tf.string`
  * `builder_from_files` and path-based community datasets
  * File format automatically restored (for datasets generated with `tfds.builder(..., file_format=)`).
  * Many new reinforcement learning datasets
  * Various bug fixes and internal improvements like:
  * Dynamically set number of worker thread during extraction
  * Update progression bar during download even if downloads are cached
  
  **Dataset creation:**
  
  * Add `tfds.features.LabeledImage` for semantic segmentation (like image but with additional `info.features['image_label'].name` label metadata)
  * Add float32 support for `tfds.features.Image` (e.g. for depth map)
  * All FeatureConnector can now have a `None` dimension anywhere (previously restricted to the first position).
  * `tfds.features.Tensor()` can have arbitrary number of dynamic dimension (`Tensor(..., shape=(None, None, 3, None)`))
  * `tfds.features.Tensor` can now be serialised as bytes, instead of float/int values (to allow better compression): `Tensor(..., encoding='zlib')`
  * Add script to add TFDS metadata files to existing TF-record (see [doc](https://www.tensorflow.org/datasets/external_tfrecord)).
  * New guide on [common implementation gotchas](https://www.tensorflow.org/datasets/common_gotchas)
  
  Thank you all for your support and contribution!

4.3.0

API:
  • Add `dataset.info.splits['train'].num_shards` to expose the number of shards to the user
  • Add `tfds.features.Dataset` to have a field containing sub-datasets (e.g. used in RL datasets)
  • Add dtype and `tf.uint16` supports for `tfds.features.Video`
  • Add `DatasetInfo.license` field to add redistributing information
  • Better `tfds.benchmark(ds)` (compatible with any iterator, not just `tf.data`, better colab representation)
  
  Other
  • Faster tfds.as_numpy()  (avoid extra tf.Tensor <> np.array copy)
  • Better `tfds.as_dataframe` visualisation (Sequence, ragged tensor, semantic masks with `use_colormap`)
  • (experimental) community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
  • (experimental) Add a hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
  • (experimental) Riegelli format support
  • (experimental) Add `DatasetInfo.disable_shuffling` to force examples to be read in generation order.
  • Add `.copy`, `.format` methods to GPath objects
  • Many bug fixes
  
  Testing:
  • Supports custom `BuilderConfig` in `DatasetBuilderTest`
  • `DatasetBuilderTest` now has a `dummy_data` class property which can be used in `setUpClass`
  • Add `add_tfds_id` and cardinality support to `tfds.testing.mock_data`
  
  And of course, many new datasets and datasets updates.
  
  We would like to thank all the TFDS contributors!

4.2.0

API:
  
  * Add `tfds build` to the CLI. See [documentation](https://www.tensorflow.org/datasets/cli#tfds_build_download_and_prepare_a_dataset).
  * DownloadManager now returns [Pathlib-like](https://docs.python.org/3/library/pathlib.html#basic-use) objects
  * Datasets returned by `tfds.as_numpy` are compatible with `len(ds)`
  * New `tfds.features.Dataset` to represent nested datasets
  * Add `tfds.ReadConfig(add_tfds_id=True)` to add a unique id to the example `ex['tfds_id']` (e.g. `b'train.tfrecord-00012-of-01024__123'`)
  * Add `num_parallel_calls` option to `tfds.ReadConfig` to overwrite to default `AUTOTUNE` option
  * `tfds.ImageFolder` now support `tfds.decode.SkipDecoder`
  * Add multichannel audio support to `tfds.features.Audio`
  * Better `tfds.as_dataframe` visualization (ffmpeg video if installed, bounding boxes,...)
  * Add `try_gcs` to `tfds.builder(..., try_gcs=True)`
  * Simpler `BuilderConfig` definition: class `VERSION` and `RELEASE_NOTES` are applied to all `BuilderConfig`. Config description is now optional.
  
  Breaking compatibility changes:
  
  * Removed configs for all text datasets. Only plain text version is kept. For example: `multi_nli/plain_text` -> `multi_nli`.
  * To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to `str`, `bytes` and `int`). New errors likely indicates an issue in the dataset implementation.
  * `tfds.core.benchmark` now returns a `pd.DataFrame` (instead of a `dict`)
  * `tfds.units` is not visible anymore from the public API
  
  Bug fixes:
  
  * Support 0-len sequence with images of dynamic shape (Fix 2616)
  * Progression bar correctly updated when copying files.
  * Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
  * Better debugging and error message (e.g. human readable size,...)
  * Allow `max_examples_per_splits=0` in `tfds build --max_examples_per_splits=0` to test `_split_generators` only (without `_generate_examples`).
  
  And of course, many new datasets and datasets updates.
  
  Thank you the community for their many valuable contributions and to supporting us in this project!!!

4.1.0

* When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See [doc](https://www.tensorflow.org/datasets/overview#manual_download_if_download_fails).
  * Simplification of the dataset creation API.
  * We've made it is easier to create datasets outside TFDS repository (see our updated [dataset creation guide](https://www.tensorflow.org/datasets/add_dataset)).
  * `_split_generators` should now returns `{'split_name': self._generate_examples(), ...}` (but current datasets are backward compatible).
  * All dataset inherit from `tfds.core.GeneratorBasedBuilder`. Converting a dataset to beam now only require changing `_generate_examples` (see [example and doc](https://www.tensorflow.org/datasets/beam_datasets#instructions)).
  * `tfds.core.SplitGenerator`, `tfds.core.BeamBasedBuilder` are deprecated and will be removed in future version.
  
  * Better `pathlib.Path`, `os.PathLike` compatibility:
  * `dl_manager.manual_dir` now returns a pathlib-Like object. Example:
  
  python
  text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
  
  
  * Note: Other `dl_manager.download`, `.extract`,... will return pathlib-like objects in future versions
  * `FeatureConnector`,... and most functions should accept `PathLike` objects. Let us know if some functions you need are missing.
  * Add a `tfds.core.as_path` to create pathlib.Path-like objects compatible with GCS (e.g. `tfds.core.as_path('gs://my-bucket/labels.csv').read_text()`).
  
  * Other bug fixes and improvement. E.g.
  * Add `verify_ssl=` option to `tfds.download.DownloadConfig` to disable SSH certificate during download.
  * `BuilderConfig` are now compatible with Beam datasets 2348
  * `--record_checksums` now assume the new dataset-as-folder model
  * `tfds.features.Images` can accept encoded `bytes` images directly (useful when used with `img_name, img_bytes = dl_manager.iter_archive('images.zip')`).
  * Doc API now show deprecated methods, abstract methods to overwrite are now documented.
  * You can generate `imagenet2012` with only a single split (e.g. only the validation data). Other split will be skipped if not present.
  * And of course new datasets
  
  Thank you to all our contributors for improving TFDS!

4.0.1

* Fix `tfds.load` when generation code isn't present
  * Fix improve GCS compatibility.
  
  Thanks carlthome for reporting and fixing the issue.

4.0.0

**API changes, new features:**
  
  * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
  * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See 2493).
  * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
  * `tfds.testing.mock_data` does not require metadata files anymore!
  * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
  * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
  * Add new `DatasetBuilder.RELEASE_NOTES` property
  * tfds.features.Image now supports PNG with 4-channels
  * `tfds.ImageFolder` now supports custom shape, dtype
  * Downloaded URLs are available through `MyDataset.url_infos`
  * Add `skip_prefetch` option to `tfds.ReadConfig`
  * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`
  
  **Breaking compatible changes:**
  
  * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
  * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
  * Remove `DatasetBuilder.IN_DEVELOPMENT` property
  * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
  * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
  * Stop testing against TF 1.15. Requires Python 3.6.8+.
  
  **Other bug fixes:**
  
  * Better archive extension detection for `dl_manager.download_and_extract`
  * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
  * Fix crash when GCS not available
  * Script to detect dead-urls
  * Improved open-source workflow, contributor guide, documentation
  * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
  
  And of course, new datasets, datasets updates.
  
  A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix for being a major contributor.

3.2.1

* Fix an issue with GCS on Windows.

3.2.0

**Future breaking change:**
  
  * The `tfds.features.text` encoding API is deprecated. Please use [tensorflow_text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) instead.
  
  **New features**
  
  API:
  
  * Add a `tfds.ImageFolder` and `tfds.TranslateFolder` to easily create custom datasets with your custom data.
  * Add a `tfds.ReadConfig(input_context=)` to shard dataset, for better multi-worker compatibility (1426).
  * The default `data_dir` can be controlled by the `TFDS_DATA_DIR` environment variable.
  * Better usability when developing datasets outside TFDS
  * Downloads are always cached
  * Checksum are optional
  * Added a `tfds.show_statistics(ds_info)` to display [FACETS OVERVIEW](https://pair-code.github.io/facets/). Note: This require the dataset to have been generated with the statistics.
  * Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)
  
  Documentation:
  
  * Catalog display images ([example](https://www.tensorflow.org/datasets/catalog/sun397#sun397standard-part2-120k))
  * Catalog shows which dataset have been recently added and are only available in `tfds-nightly` <span class="material-icons">nights_stay</span>
  
  Breaking compatibility change:
  
  * Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
  * Remove `tfds.load('image_label_folder')` in favor of the more user-friendly `tfds.ImageFolder`
  
  Other:
  
  * Various performances improvements for both generation and reading (e.g. use `__slot__`, fix parallelisation bug in `tf.data.TFRecordReader`,...)
  * Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)
  
  Thanks to all our contributors who help improving the state of dataset for the entire research community!

3.1.0

**Beaking compatibility change:**
  
  * Rename `tfds.core.NamedSplit`, `tfds.core.SplitBase` -> `tfds.Split`. Now `tfds.Split.TRAIN`,... are instance of `tfds.Split`
  * Remove deprecated `num_shards` argument from `tfds.core.SplitGenerator`. This argument was ignored as shards are automatically computed.
  
  **Future breaking compatibility changes:**
  
  * Rename `interleave_parallel_reads` -> `interleave_cycle_length` for `tfds.ReadConfig`.
  * Invert ds, ds_info argument orders for `tfds.show_examples`Future breaking change:
  * The `tfds.features.text` encoding API is deprecated. Please use `tensorflow_text` instead.
  
  Other changes:
  
  * Testing: Add support for custom decoders in `tfds.testing.mock_data`
  * Documentation: shows which datasets are only present in `tfds-nightly`
  * Documentation: display images for supported datasets
  * API: Add `tfds.builder_cls(name)` to access a DatasetBuilder class by name
  * API: Add `info.split['train'].filenames` for access to the tf-record files.
  * API: Add `tfds.core.add_data_dir` to register an additional data dir
  * Remove most `ds.with_options` which where applied by TFDS. Now use tf.data default.
  * Other bug fixes and improvement (Better error messages, windows compatibility,...)
  
  Thank you all for your contributions, and helping us make TFDS better for everyone!

3.0.0

Breaking changes:
  
  * Legacy mode `tfds.experiment.S3` has been removed
  * New  `image_classification` section. Some datasets have been move there from `images`.
  * `in_memory` argument has been removed from `as_dataset`/`tfds.load` (small datasets are now auto-cached).
  * `DownloadConfig` do not append the dataset name anymore (manual data should be in `<manual_dir>/` instead of `<manual_dir>/<dataset_name>/`)
  * Tests now check that all `dl_manager.download` urls has registered checksums. To opt-out, add `SKIP_CHECKSUMS = True` to your `DatasetBuilderTestCase`.
  * `tfds.load` now always returns `tf.compat.v2.Dataset`. If you're using still using `tf.compat.v1`:
  * Use `tf.compat.v1.data.make_one_shot_iterator(ds)` rather than `ds.make_one_shot_iterator()`
  * Use `isinstance(ds, tf.compat.v2.Dataset)` instead of `isinstance(ds, tf.data.Dataset)`
  * `tfds.Split.ALL` has been removed from the API.
  
  Future breaking change:
  
  * The `tfds.features.text` encoding API is deprecated. Please use [tensorflow_text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) instead.
  * `num_shards` argument of `tfds.core.SplitGenerator` is currently ignored and will be removed in the next version.
  
  Features:
  
  * `DownloadManager` is now pickable (can be used inside Beam pipelines)
  * `tfds.features.Audio`:
  * Support float as returned value
  * Expose sample_rate through `info.features['audio'].sample_rate`
  * Support for encoding audio features from file objects
  * Various bug fixes, better error messages, documentation improvements
  * More datasets
  
  Thank you to all our contributors for helping us make TFDS better for everyone!

2.1.0

New features:
  
  * Datasets expose `info.dataset_size` and `info.download_size`. All datasets generated with 2.1.0 cannot be loaded with previous version (previous datasets can be read with `2.1.0` however).
  * [Auto-caching small datasets](https://www.tensorflow.org/datasets/performances#auto-caching). `in_memory` argument is deprecated and will be removed in a future version.
  * Datasets expose their cardinality `num_examples = tf.data.experimental.cardinality(ds)` (Requires tf-nightly or TF >= 2.2.0)
  * Get the number of example in a sub-splits with: `info.splits['train[70%:]'].num_examples`

2.0.0

* This is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
  * The default versions of all datasets are now using the S3 slicing API. See the [guide](https://www.tensorflow.org/datasets/splits) for details.
  * The previous split API is still available, but is deprecated. If you wrote `DatasetBuilder`s outside the TFDS repository, please make sure they do not use `experiments={tfds.core.Experiment.S3: False}`. This will be removed in the next version, as well as the `num_shards` kwargs from `SplitGenerator`.
  * Several new datasets. Thanks to all the [contributors](https://github.com/tensorflow/datasets/graphs/contributors)!
  * API changes and new features:
  * `shuffle_files` defaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the new `read_config` parameter in [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load).
  * `urls` kwargs renamed `homepage` in `DatasetInfo`
  * Support for nested `tfds.features.Sequence` and `tf.RaggedTensor`
  * Custom `FeatureConnector`s can override the `decode_batch_example` method for efficient decoding when wrapped inside a `tfds.features.Sequence(my_connector)`
  * Declaring a dataset in Colab won't register it, which allow to re-run the cell without having to change the name
  * Beam datasets can use a `tfds.core.BeamMetadataDict` to store additional metadata computed as part of the Beam pipeline.
  * Beam datasets' `_split_generators` accepts an additional `pipeline` kwargs to define a pipeline shared between all splits.
  * Various other bug fixes and performance improvements. Thank you for all the reports and fixes!

1.13.3

Dataset changes
  
  - Fix: Adapt all audio datasets 3081 (patrickvonplaten)
  
  Bug fixes
  
  - Update BibTeX entry 3090 (albertvillanova)
  - Use template column_mapping to transmit_format instead of template features 3088 (mariosasko)
  - Fix Audio feature mp3 resampling 3096 (albertvillanova)

1.13.2

Bug fixes
  
  - Fix error related to huggingface_hub timeout parameter 3082 (albertvillanova)
  - Remove _resampler from Audio fields 3086 (albertvillanova)

1.13.1

Bug fixes
  
  - Fix loading a metric with internal import 3077 (albertvillanova)

1.13.0

Dataset changes
  
  - New: CaSiNo 2867 (kushalchawla)
  - New: Mostly Basic Python Problems 2893 (lvwerra)
  - New: OpenAI's HumanEval 2897 (lvwerra)
  - New: SemEval-2018 Task 1: Affect in Tweets 2745 (maxpel)
  - New: SEDE 2942 (Hazoom)
  - New: Jigsaw unintended Bias 2935 (Iwontbecreative)
  - New: AMI 2853 (cahya-wirawan)
  - New: Math Aptitude Test of Heuristics 2982 3014 (hacobe, albertvillanova)
  - New: SwissJudgmentPrediction 2983 (JoelNiklaus)
  - New: KanHope 2985 (adeepH)
  - New: CommonLanguage 2989 3006 3003 (anton-l, albertvillanova, jimregan)
  - New: SwedMedNER 2940 (bwang482)
  - New: SberQuAD 3039 (Alenush)
  - New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English 3004 (iliaschalkidis)
  - New: Greek Legal Code 2966 (christospi)
  - New: Story Cloze Test 3067 (zaidalyafeai)
  - Update: SUPERB - add IC, SI, ER tasks 2884 3009 (anton-l, albertvillanova)
  - Update: MENYO-20k - repo has moved, updating URL 2939 (cdleong)
  - Update: TriviaQA - add web and wiki config 2949 (shirte)
  - Update: nq_open - Use standard open-domain validation split 3029 (craffel)
  - Update: MeDAL - Add further description and update download URL 3022 (xhlulu)
  - Update: Biosses - fix column names 3054 (bwang482)
  - Fix: scitldr - fix minor URL format 2948 (albertvillanova)
  - Fix: masakhaner - update JSON metadata 2973 (albertvillanova)
  - Fix: TriviaQA - fix unfiltered subset 2995 (lhoestq)
  - Fix: TriviaQA - set writer batch size 2999 (lhoestq)
  - Fix: LJ Speech - fix Windows paths 3016 (albertvillanova)
  - Fix: MedDialog - update metadata JSON 3046 (albertvillanova)
  
  Metric changes
  
  - Update: meteor - update from nltk update 2946 (lhoestq)
  - Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics 3012 3001 (albertvillanova, mariosasko)
  - Fix: f1/precision/recall metrics with None average 3008 2992 (albertvillanova)
  - Fix meteor metric for version >= 3.6.4 3056 (albertvillanova)
  
  Dataset features
  
  - Use with TensorFlow:
  - Adding `to_tf_dataset` method 2731 2931 2951 2974 (Rocketknight1)
  - Better support for ZIP files:
  - Support loading dataset from multiple zipped CSV data files 3021 (albertvillanova)
  - Load private data files + use glob on ZIP archives for json/csv/etc. module inference 3041 (lhoestq)
  - Streaming improvements:
  - Extend support for streaming datasets that use glob.glob 3015 (albertvillanova)
  - Add `remove_columns` to `IterableDataset` 3030 (cccntu)
  - All the above ZIP features also work in streaming mode
  - New utilities:
  - Add `get_dataset_split_names()` to get a dataset config's split names 2906 (severo)
  - Replace script_version with revision 2933 (albertvillanova)
  - The `script_version` parameter in `load_dataset` is now deprecated, in favor of `revision`
  - Experimental - Create Audio feature type 2324 (albertvillanova):
  - It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed
  
  Dataset cards
  
  - Add arxiv paper inswiss_judgment_prediction dataset card 3026 (JoelNiklaus)
  
  Documentation
  
  - Add tutorial for no-code dataset upload 2925 (stevhliu)
  
  General improvements and bug fixes
  
  - Fix filter leaking 3019 (lhoestq)
  - calling `filter` several times in a row was not returning the right results in 1.12.0 and 1.12.1
  - Update BibTeX entry 2928 (albertvillanova)
  - Fix exception chaining 2911 (albertvillanova)
  - Add regression test for null Sequence 2929 (albertvillanova)
  - Don't use old, incompatible cache for the new `filter` 2947 (lhoestq)
  - Fix fn kwargs in filter 2950 (lhoestq)
  - Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast 2895 (arsarabi)
  - Check that array is not Float as nan != nan 2936 (Iwontbecreative)
  - Fix missing conda deps 2952 (lhoestq)
  - Update legacy Python image for CI tests in Linux 2955 (albertvillanova)
  - Support pandas 1.3 new `read_csv` parameters 2960 (SBrandeis)
  - Fix CI doc build 2961 (albertvillanova)
  - Run tests in parallel 2954 (albertvillanova)
  - Ignore dummy folder and dataset_infos.json 2975 (Ishan-Kumar2)
  - Take namespace into account in caching 2938 (lhoestq)
  - Make Dataset.map accept list of np.array 2990 (albertvillanova)
  - Fix loading compressed CSV without streaming 2994 (albertvillanova)
  - Fix json loader when conversion not implemented 3000 (lhoestq)
  - Remove all query parameters when extracting protocol 2996 (albertvillanova)
  - Correct a typo 3007 (Yann21)
  - Fix Windows test suite 3025 (albertvillanova)
  - Remove unused parameter in xdirname 3017 (albertvillanova)
  - Properly install ruamel-yaml for windows CI 3028 (lhoestq)
  - Fix typo 3023 (qqaatw)
  - Extend support for streaming datasets that use glob.glob 3015 (albertvillanova)
  - Actual "proper" install of ruamel.yaml in the windows CI 3033 (lhoestq)
  - Use cache folder for lockfile 2887 (Dref360)
  - Fix streaming: catch Timeout error 3050 (borisdayma)
  - Refac module factory + avoid etag requests for hub datasets 2986 (lhoestq)
  - Fix task reloading from cache 3059 (lhoestq)
  - Fix test command after refac 3065 (lhoestq)
  - Fix Windows CI with FileNotFoundError when setting up s3_base fixture 3070 (albertvillanova)
  - Update summary on PyPi beyond NLP 3062 (thomwolf)
  - Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset 3002 (mariosasko)
  - feat: increase streaming retry config 3068 (borisdayma)
  - Fix pathlib patches for streaming 3072 (lhoestq)
  
  Breaking changes:
  
  - Due to the big refactoring at 2986, the `prepare_module` function doesn't support the `return_resolved_file_path ` and `return_associated_base_path` parameters. As an alternative, you may use the `dataset_module_factory` instead.

1.12.1

Bug fixes
  
  - Fix fsspec AbstractFileSystem access 2915 (pierre-godard)
  - Fix unwanted tqdm bar when accessing examples 2920 (lhoestq)
  - Fix conversion of multidim arrays in list to arrow 2922 (lhoestq):
  - this fixes the `ArrowInvalid: Can only convert 1-dimensional array values` errors

1.12.0

New documentation
  - New documentation structure 2718 (stevhliu):
  - New: Tutorials
  - New: Hot-to guides
  - New: Conceptual guides
  - Update: Reference
  
  See the new documentation [here](https://huggingface.co/docs/datasets/) !
  
  Datasets changes
  - New: VIVOS dataset for Vietnamese ASR 2780 (binh234)
  - New: The Pile books3 2801 (richarddwang)
  - New: The Pile stack exchange 2803 (richarddwang)
  - New: The Pile openwebtext2 2802 (richarddwang)
  - New: Food-101 2804 (nateraw)
  - New: Beans 2809 (nateraw)
  - New: cedr 2796 (naumov-al)
  - New: cats_vs_dogs 2807 (nateraw)
  - New: MultiEURLEX 2865 (iliaschalkidis)
  - New: BIOSSES 2881 (bwang482)
  - Update: TTC4900 - add download URL 2732 (yavuzKomecoglu)
  - Update: Wikihow - Generate metadata JSON for wikihow dataset 2748 (albertvillanova)
  - Update: lm1b - Generate metadata JSON 2752 (albertvillanova)
  - Update: reclor - Generate metadata JSON 2753 (albertvillanova)
  - Update: telugu_books - Generate metadata JSON 2754 (albertvillanova)
  - Update: SUPERB - Add SD task 2661 (albertvillanova)
  - Update: SUPERB - Add KS task 2783 (anton-l)
  - Update: GooAQ - add train/val/test splits 2792 (bhavitvyamalik)
  - Update: Openwebtext - update size 2857 (lhoestq)
  - Update: timit_asr - make the dataset streamable 2835 (lhoestq)
  - Fix: journalists_questions -fix key by recreating metadata JSON 2744 (albertvillanova)
  - Fix: turkish_movie_sentiment - fix metadata JSON 2755 (albertvillanova)
  - Fix: ubuntu_dialogs_corpus - fix metadata JSON 2756 (albertvillanova)
  - Fix: CNN/DailyMail - typo 2791 (omaralsayed)
  - Fix: linnaeus - fix url 2852 (lhoestq)
  - Fix ToTTo - fix data URL 2864 (albertvillanova)
  - Fix: wikicorpus - fix keys 2844 (lhoestq)
  - Fix: COUNTER - fix bad file name 2894 (albertvillanova)
  - Fix: DocRED - fix data URLs and metadata  2883 (albertvillanova)
  
  Datasets features
  - Load Dataset from the Hub (NO DATASET SCRIPT) 2662 (lhoestq)
  - Preserve dtype for numpy/torch/tf/jax arrays 2361 (bhavitvyamalik)
  - add multi-proc in `to_json` 2747 (bhavitvyamalik)
  - Optimize Dataset.filter to only compute the indices to keep 2836 (lhoestq)
  
  Dataset streaming - better support for compression:
  - Fix streaming zip files 2798 (albertvillanova)
  - Support streaming tar files 2800 (albertvillanova)
  - Support streaming compressed files (gzip, bz2, lz4, xz, zst) 2786 (albertvillanova)
  - Fix streaming zip files from canonical datasets 2805 (albertvillanova)
  - Add url prefix convention for many compression formats 2822 (lhoestq)
  - Support streaming datasets that use pathlib 2874 (albertvillanova)
  - Extend support for streaming datasets that use pathlib.Path stem/suffix 2880 (albertvillanova)
  - Extend support for streaming datasets that use pathlib.Path.glob 2876 (albertvillanova)
  
  Metrics changes
  - Update: BERTScore - Add support for fast tokenizer 2770 (mariosasko)
  - Fix: Sacrebleu - Fix sacrebleu tokenizers 2739 2778 2779 (albertvillanova)
  
  Dataset cards
  - Updated dataset description of DaNE 2789 (KennethEnevoldsen)
  - Update ELI5 README.md 2848 (odellus)
  
  General improvements and bug fixes
  - Update release instructions 2740 (albertvillanova)
  - Raise ManualDownloadError when loading a dataset that requires previous manual download 2758 (albertvillanova)
  - Allow PyArrow from source 2769 (patrickvonplaten)
  - fix typo (ShuffingConfig -> ShufflingConfig) 2766 (daleevans)
  - Fix typo in test_dataset_common 2790 (nateraw)
  - Fix type hint for data_files 2793 (albertvillanova)
  - Bump tqdm version 2814 (mariosasko)
  - Use packaging to handle versions 2777 (albertvillanova)
  - Tiny typo fixes of "fo" -> "of" 2815 (aronszanto)
  - Rename The Pile subsets 2817 (lhoestq)
  - Fix IndexError by ignoring empty RecordBatch 2834 (lhoestq)
  - Fix defaults in cache_dir docstring in load.py 2824 (mariosasko)
  - Fix extraction protocol inference from urls with params 2843 (lhoestq)
  - Fix caching when moving script 2854 (lhoestq)
  - Fix windows CI CondaError 2855 (lhoestq)
  - fix: 🐛 remove URL's query string only if it's ?dl=1 2856 (severo)
  - Update `column_names` showed as `:func:` in exploring.st 2851 (ClementRomac)
  - Fix s3fs version in CI 2858 (lhoestq)
  - Fix three typos in two files for documentation 2870 (leny-mi)
  - Move checks from _map_single to map 2660 (mariosasko)
  - fix regex to accept negative timezone 2847 (jadermcs)
  - Prevent .map from using multiprocessing when loading from cache 2774 (thomasw21)
  - Fix null sequence encoding 2900 (lhoestq)

1.11.0

Datasets Changes
  - New: Add Russian SuperGLUE 2668 (slowwavesleep)
  - New: Add Disfl-QA 2473 (bhavitvyamalik)
  - New: Add TimeDial 2476 (bhavitvyamalik)
  - Fix: Enumerate all ner_tags values in WNUT 17 dataset 2713 (albertvillanova)
  - Fix: Update WikiANN data URL 2710 (albertvillanova)
  - Fix: Update PAN-X data URL in XTREME dataset 2715 (albertvillanova)
  - Fix: C4 - en subset by modifying dataset_info with correct validation infos 2723 (thomasw21)
  
  General improvements and bug fixes
  - fix: 🐛 change string format to allow copy/paste to work in bash 2694 (severo)
  - Update BibTeX entry 2706 (albertvillanova)
  - Print absolute local paths in load_dataset error messages 2684 (mariosasko)
  - Add support for disable_progress_bar on Windows 2696 (mariosasko)
  - Ignore empty batch when writing 2698 (pcuenca)
  - Fix shuffle on IterableDataset that disables batching in case any functions were mapped 2717 (amankhandelia)
  - fix: 🐛 fix two typos 2720 (severo)
  - Docs details 2690 (severo)
  - Deal with the bad check in test_load.py 2721 (mariosasko)
  - Pass use_auth_token to request_etags 2725 (albertvillanova)
  - Typo fix `tokenize_exemple` 2726 (shabie)
  - Fix IndexError while loading Arabic Billion Words dataset 2729 (albertvillanova)
  - Add missing parquet known extension 2733 (lhoestq)

1.10.2

The error message to tell which dataset config name to load was not displayed:
  - Fix pick default config name message 2704 (lhoestq)
  
  Docstrings:
  - Fix download_mode docstrings 2701 (albertvillanova)

1.10.1

- Fix minimum tqdm version and import on Colab 2697 (nateraw)
  - Fix OSCAR Esperanto 2693 (lhoestq)

1.10.0

Datasets Features
  - Support remote data files 2616 (albertvillanova)
  This allows to pass URLs of remote data files to any dataset loader:
  python
  load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
  
  This works for all these dataset loaders:
  - text
  - csv
  - json
  - parquet
  - pandas
  - Streaming from remote text/json/csv/parquet/pandas files:
  When you pass URLs to a dataset loader, you can enable streaming mode with `streaming=True`. Main contributions:
  - Streaming for the Pandas loader 2636 (lhoestq)
  - Streaming for the CSV loader 2635 (lhoestq)
  - Streaming for the Json loader 2608 (albertvillanova) 2638 (lhoestq)
  - Faster search_batch for ElasticsearchIndex due to threading 2581 (mwrzalik)
  - Delete extracted files when loading dataset 2631 (albertvillanova)
  
  Datasets Changes
  - Fix: C4 - fix expected files list 2682 (lhoestq)
  - Fix: SQuAD - fix misalignment 2586 (albertvillanova)
  - Fix: omp - fix DuplicatedKeysError2603 (albertvillanova)
  - Fix: wi_locness -  potential DuplicatedKeysError 2609 (albertvillanova)
  - Fix: LibriSpeech -  potential DuplicatedKeysError 2672 (albertvillanova)
  - Fix: SQuAD -  potential DuplicatedKeysError 2673 (albertvillanova)
  - Fix: Blog Authorship Corpus - fix split sizes and text encoding 2685 (albertvillanova)
  
  Dataset Tasks
  - Add speech processing tasks 2620 (lewtun)
  - Update ASR tags 2633 (lewtun)
  - Inject ASR template for lj_speech dataset 2634 (albertvillanova)
  - Add ASR task for SUPERB 2619 (lewtun)
  - add image-classification task template 2632 (nateraw)
  
  Metrics Changes
  - New: wiki_split 2623 (bhadreshpsavani)
  - Update: accuracy,f1,precision,recall - Support multilabel metrics 2589 (albertvillanova)
  - Fix: sacrebleu - fix parameter name 2674 (albertvillanova)
  
  General improvements and bug fixes
  - Fix BibTeX entry 2594 (albertvillanova)
  - Fix test_is_small_dataset 2588 (albertvillanova)
  - Remove import of transformers 2602 (albertvillanova)
  - Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) 2605 (lhoestq)
  - Fix `filter` with multiprocessing in case all samples are discarded 2601 (mxschmdt)
  - Remove redundant prepare_module 2597 (albertvillanova)
  - Create ExtractManager 2295 (albertvillanova)
  - Return Python float instead of numpy.float64 in sklearn metrics 2612 (lewtun)
  - Use ndarray.item instead of ndarray.tolist 2613 (lewtun)
  - Convert numpy scalar to python float in Pearsonr output 2614 (lhoestq)
  - Fix missing EOL issue in to_json for old versions of pandas 2617 (lhoestq)
  - Use correct logger in metrics.py 2626 (mariosasko)
  - Minor fix tests with Windows paths 2627 (albertvillanova)
  - Use ETag of remote data files 2628 (albertvillanova)
  - More consistent naming 2611 (mariosasko)
  - Refactor patching to specific submodule 2639 (albertvillanova)
  - Fix docstrings 2640 (albertvillanova)
  - Fix anchor in README 2647 (mariosasko)
  - Fix logging docstring 2652 (mariosasko)
  - Allow dataset config kwargs to be None 2659 (lhoestq)
  - Use prefix to allow exceed Windows MAX_PATH 2621 (albertvillanova)
  - Use tqdm from tqdm_utils 2667 (mariosasko)
  - Increase json reader block_size automatically 2676 (lhoestq)
  - Parallelize ETag requests 2675 (lhoestq)
  - Fix bad config ids that name cache directories 2686 (lhoestq)
  - Minor documentation fix 2687 (slowwavesleep)
  
  Dataset Cards
  - Add missing WikiANN language tags 2610 (albertvillanova)
  - feat: 🎸 add paperswithcode id for qasper dataset 2680 (severo)
  
  Docs
  - Update processing.rst with other export formats 2599 (TevenLeScao)

1.9.0

Datasets Changes
  - New: C4 2575 2592 (lhoestq)
  - New: mC4 2576 (lhoestq)
  - New: MasakhaNER 2465 (dadelani)
  - New: Eduge 2492 (enod)
  - Update: xor_tydi_qa - update version 2455 (cccntu)
  - Update: kilt-TriviaQA - original answers  2410 (PaulLerner)
  - Update: udpos - change features structure 2466 (jerryIsHere)
  - Update: WebNLG - update checksums 2558 (lhoestq)
  - Fix: climate fever - adjusting indexing for the labels. 2464 (drugilsberg)
  - Fix: proto_qa - fix download link 2463 (mariosasko)
  - Fix: ProductReviews - fix label parsing 2530 (yavuzKomecoglu)
  - Fix: DROP - fix DuplicatedKeysError 2545 (albertvillanova)
  - Fix: code_search_net - fix keys 2555 (lhoestq)
  - Fix: discofuse - fix link cc 2541 (VictorSanh)
  - Fix: fever - fix keys 2557 (lhoestq)
  
  Datasets Features
  - Dataset Streaming 2375 2582 (lhoestq)
  - Fast download and process your data on-the-fly when iterating over your dataset
  - Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
  - JAX integration 2502 (lhoestq)
  - Add Parquet loader + from_parquet and to_parquet 2537 (lhoestq)
  - Implement ClassLabel encoding in JSON loader 2468 (albertvillanova)
  - Set configurable downloaded datasets path 2488 (albertvillanova)
  - Set configurable extracted datasets path 2487 (albertvillanova)
  - Add align_labels_with_mapping function 2457 (lewtun) 2510 (lhoestq)
  - Add interleave_datasets for map-style datasets 2568 (lhoestq)
  - Add load_dataset_builder 2500 (mariosasko)
  - Support Zstandard compressed files 2578 (albertvillanova)
  
  Task templates
  - Add task templates for tydiqa and xquad 2518 (lewtun)
  - Insert text classification template for Emotion dataset 2521 (lewtun)
  - Add summarization template 2529 (lewtun)
  - Add task template for automatic speech recognition 2533 (lewtun)
  - Remove task templates if required features are removed during `Dataset.map` 2540 (lewtun)
  - Inject templates for ASR datasets 2565 (lewtun)
  
  General improvements and bug fixes
  - Allow to use tqdm>=4.50.0 2482 (lhoestq)
  - Use gc.collect only when needed to avoid slow downs 2483 (lhoestq)
  - Allow latest pyarrow version 2490 (albertvillanova)
  - Use default cast for sliced list arrays if pyarrow >= 4 2497 (albertvillanova)
  - Add Zenodo metadata file with license 2501 (albertvillanova)
  - add tensorflow-macos support 2493 (slayerjain)
  - Keep original features order 2453 (albertvillanova)
  - Add course banner 2506 (sgugger)
  - Rearrange JSON field names to match passed features schema field names 2507 (albertvillanova)
  - Fix typo in MatthewsCorrelation class name 2517 (albertvillanova)
  - Use scikit-learn package rather than sklearn in setup.py 2525 (lesteve)
  - Improve performance of pandas arrow extractor 2519 (albertvillanova)
  - Fix fingerprint when moving cache dir 2509 (lhoestq)
  - Replace bad `n>1M` size tag 2527 (lhoestq)
  - Fix dev version 2531 (lhoestq)
  - Sync with transformers disabling NOTSET 2534 (albertvillanova)
  - Fix logging levels 2544 (albertvillanova)
  - Add support for Split.ALL 2259 (mariosasko)
  - Raise FileNotFoundError in WindowsFileLock 2524 (mariosasko)
  - Make numpy arrow extractor faster 2505 (lhoestq)
  - fix Dataset.map when num_procs > num rows 2566 (connor-mccarthy)
  - Add ASR task and new languages to resources 2567 (lewtun)
  - Filter expected warning log from transformers 2571 (albertvillanova)
  - Fix BibTeX entry 2579 (albertvillanova)
  - Fix Counter import 2580 (albertvillanova)
  - Add aiohttp to tests extras require 2587 (albertvillanova)
  - Add language tags 2590 (lewtun)
  - Support pandas 1.3.0 read_csv 2593 (lhoestq)
  
  Dataset cards
  - Updated Dataset Description 2420 (binny-mathew)
  - Update DatasetMetadata and ReadMe 2436 (gchhablani)
  - CRD3 dataset card 2515 (wilsonyhlee)
  - Add license to the Cambridge English Write & Improve + LOCNESS dataset card 2546 (lhoestq)
  - wi_locness: reference latest leaderboard on codalab 2584 (aseifert)
  
  Docs
  - no s at load_datasets  2479 (julien-c)
  - Fix docs custom stable version 2477 (albertvillanova)
  - Improve Features docs 2535 (albertvillanova)
  - Update README.md 2414 (cryoff)
  - Fix FileSystems documentation 2551 (connor-mccarthy)
  - Minor fix in loading metrics docs 2562 (albertvillanova)
  - Minor fix docs format for bertscore 2570 (albertvillanova)
  - Add streaming in load a dataset docs 2574 (lhoestq)

1.8.0

Datasets Changes
  - New: Microsoft CodeXGlue Datasets 2357 (madlag ncoop57)
  - New: KLUE benchmark 2416 (jungwhank)
  - New: HendrycksTest 2370 (andyzoujm)
  - Update: xor_tydi_qa - update url to v1.1 2449 (cccntu)
  - Fix: adversarial_qa - DuplicatedKeysError 2433 (mariosasko)
  - Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for  2445 (lewtun)
  - Fix: flores - fix download link 2448 (mariosasko)
  
  Datasets Features
  - Add `desc` parameter in `map` for `DatasetDict` object 2423 (bhavitvyamalik)
  - Support sliced list arrays in cast 2461 (lhoestq)
  - `Dataset.cast` can now change the feature types of Sequence fields
  - Revert default in-memory for small datasets 2460 (albertvillanova) Breaking:
  - we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
  - we changed this to zero: by default datasets are **loaded from the disk** with memory mapping and **not copied in memory**
  - users can still set `keep_in_memory=True` when loading a dataset to load it in memory
  
  Datasets Cards
  - adds license information for DailyDialog. 2419 (aditya2211)
  - add english language tags for ~100 datasets 2442 (VictorSanh)
  - Add copyright info to MLSUM dataset 2427 (PhilipMay)
  - Add copyright info for wiki_lingua dataset 2428 (PhilipMay)
  - Mention that there are no answers in adversarial_qa test set 2451 (lhoestq)
  
  General improvements and bug fixes
  - Add DOI badge to README 2411 (albertvillanova)
  - Make datasets PEP-561 compliant 2417 (SBrandeis)
  - Fix save_to_disk nested features order in dataset_info.json 2422 (lhoestq)
  - Fix CI six installation on linux 2432 (lhoestq)
  - Fix Docstring Mistake: dataset vs. metric 2425 (PhilipMay)
  - Fix NQ features loading: reorder fields of features to match nested fields order in arrow data 2438 (lhoestq)
  - doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES 2421 (borisdayma)
  - add utf-8 while reading README 2418 (bhavitvyamalik)
  - Better error message when trying to access elements of a DatasetDict without specifying the split 2439 (lhoestq)
  - Rename config and environment variable for in memory max size 2454 (albertvillanova)
  - Add version-specific BibTeX 2430 (albertvillanova)
  - Fix cross-reference typos in documentation 2456 (albertvillanova)
  - Better error message when using the wrong load_from_disk 2437 (lhoestq)
  
  Experimental and work in progress: Format a dataset for specific tasks
  - Update text classification template labels in DatasetInfo __post_init__ 2392 (lewtun)
  - Insert task templates for text classification 2389 (lewtun)
  - Rename QuestionAnswering template to QuestionAnsweringExtractive 2429 (lewtun)
  - Insert Extractive QA templates for SQuAD-like datasets 2435 (lewtun)

1.7.0

Dataset Changes
  - New: NLU evaluation data 2238 (dkajtoch)
  - New: Add SLR32, SLR52, SLR53 to OpenSLR 2241, 2311 (cahya-wirawan)
  - New: Bbaw egyptian 2290 (phiwi)
  - New: GooAQ 2260 (bhavitvyamalik)
  - New: SubjQA 2302 (lewtun)
  - New: Ascent KB 2341, 2349 (phongnt570)
  - New: HLGD 2325 (tingofurro)
  - New: Qasper 2346 (cceyda)
  - New: ConvQuestions benchmark 2372 (PhilippChr)
  - Update: Wikihow - Clarify how to load wikihow 2240 (albertvillanova)
  - Update multi_woz_v22 - update checksum 2281 (lhoestq)
  - Update: OSCAR - Set encoding in OSCAR dataset 2321 (albertvillanova)
  - Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME 2326 (lewtun)
  - Update: GEM - the DART file checksums in GEM 2334 (yjernite)
  - Update: web_science - fixed download link 2338 (bhavitvyamalik)
  - Update: SNLI, MNLI- README updated for SNLI, MNLI 2364 (bhavitvyamalik)
  - Update: conll2003 - correct labels 2369 (philschmid)
  - Update: offenseval_dravidian - update citations 2385 (adeepH)
  - Update: ai2_arc - Add dataset tags 2405 (OyvindTafjord)
  - Fix: newsph_nli - test data added, dataset_infos updated 2263 (bhavitvyamalik)
  - Fix: hyperpartisan news detection - Remove getchildren 2367 (ghomasHudson)
  - Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset 2397 (albertvillanova)
  - Fix: head_qa - Fix keys 2408 (lhoestq)
  
  Dataset Features
  - Implement Dataset add_item 1870 (albertvillanova)
  - Implement Dataset add_column 2145 (albertvillanova)
  - Implement Dataset to JSON 2248, 2352 (albertvillanova)
  - Add rename_columnS method 2312 (SBrandeis)
  - add `desc` to `tqdm` in `Dataset.map()` 2374 (bhavitvyamalik)
  - Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES 2399, 2409 (albertvillanova)
  
  Metric Changes
  - New: CUAD metrics 2273 (bhavitvyamalik)
  - New: Matthews/Pearson/Spearman correlation metrics 2328 (lhoestq)
  - Update: CER - Docs, CER above 1 2342 (borisdayma)
  
  General improvements and bug fixes
  - Update black 2265 (lhoestq)
  - Fix incorrect update_metadata_with_features calls in ArrowDataset 2258 (mariosasko)
  - Faster map w/ input_columns & faster slicing w/ Iterable keys 2246 (norabelrose)
  - Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers 2268 (lhoestq)
  - Fix query table with iterable 2269 (lhoestq)
  - Perform minor refactoring: use config 2253 (albertvillanova)
  - Update format, fingerprint and indices after add_item 2254 (lhoestq)
  - Always update metadata in arrow schema 2274 (lhoestq)
  - Make tests run faster 2266 (lhoestq)
  - Fix metadata validation with config names 2286 (lhoestq)
  - Fixed typo seperate->separate 2292 (laksh9950)
  - Allow collaborators to self-assign issues 2289 (albertvillanova)
  - Mapping in the distributed setting 2298 (TevenLeScao)
  - Fix conda release 2309 (lhoestq)
  - Fix incorrect version specification for the pyarrow package 2317 (cemilcengiz)
  - Set default name in init_dynamic_modules 2320 (albertvillanova)
  - Fix duplicate keys 2333 (lhoestq)
  - Add note about indices mapping in save_to_disk docstring 2332 (lhoestq)
  - Metadata validation 2107 (theo-m)
  - Add Validation For README 2121 (gchhablani)
  - Fix overflow issue in interpolation search 2336 (mariosasko)
  - Datasets cli improvements 2315 (mariosasko)
  - Add `key` type and duplicates verification with hashing 2245 (NikhilBartwal)
  - More consistent copy logic 2340 (mariosasko)
  - Update README vallidation rules 2353 (gchhablani)
  - normalized TOCs and titles in data cards 2355 (yjernite)
  - simpllify faiss index save 2351 (Guitaricet)
  - Allow "other-X" in licenses 2368 (gchhablani)
  - Improve ReadInstruction logic and update docs 2261 (mariosasko)
  - Disallow duplicate keys in yaml tags 2379 (lhoestq)
  - maintain YAML structure reading from README 2380 (bhavitvyamalik)
  - add dataset card title 2381 (bhavitvyamalik)
  - Add tests for dataset cards 2348 (gchhablani)
  - Improve example in rounding docs 2383 (mariosasko)
  - Paperswithcode dataset mapping 2404 (julien-c)
  - Free datasets with cache file in temp dir on exit 2403 (mariosasko)
  
  Experimental and work in progress: Format a dataset for specific tasks
  - Task formatting for text classification & question answering 2255 (SBrandeis)
  - Add check for task templates on dataset load 2390 (lewtun)
  - Add args description to DatasetInfo 2384 (lewtun)
  - Improve task api code quality 2376 (mariosasko)

1.6.2

Fix memory issue: don't copy recordbatches in memory during a table deepcopy 2291 (lhoestq)
  This affected methods like `concatenate_datasets`, multiprocessed `map` and `load_from_disk`.
  
  Breaking change:
  - when using `Dataset.map` with the `input_columns` parameter, the resulting dataset will only have the columns from `input_columns` and the columns added by the map functions. The other columns are discarded.

1.6.1

Fix memory issue in multiprocessing: Don't pickle table index 2264 (lhoestq)

1.6.0

Dataset changes
  - New: MOROCO 2002 (MihaelaGaman)
  - New: CBT dataset 2044 (gchhablani)
  - New: MDD Dataset 2051 (gchhablani)
  - New: Multilingual dIalogAct benchMark (miam) 2047 (eusip)
  - New: bAbI QA tasks 2053 (gchhablani)
  - New: machine translated multilingual STS benchmark dataset 2090 (PhilipMay)
  - New: EURLEX legal NLP dataset 2114 (iliaschalkidis)
  - New: ECtHR legal NLP dataset 2114 (iliaschalkidis)
  - New: EU-REG-IR legal NLP dataset 2114 (iliaschalkidis)
  - New: NorNE dataset for Norwegian POS and NER 2154 (versae)
  - New: banking77 2140 (dkajtoch)
  - New: OpenSLR 2173 2215 2221 (cahya-wirawan)
  - New: CUAD dataset 2219 (bhavitvyamalik)
  - Update: Gem V1.1 + new challenge sets2142 2186 (yjernite)
  - Update: Wikiann - added spans field 2141 (rabeehk)
  - Update: XTREME - Add tel to xtreme tatoeba 2180 (lhoestq)
  - Update: GLUE MRPC - added real label to test set 2216 (philschmid)
  - Fix: MultiWoz22 - fix dialogue action slot name and value 2136 (adamlin120)
  - Fix: wikiauto - fix link 2171 (mounicam)
  - Fix: wino_bias - use right splits 1930 (JieyuZhao)
  - Fix: lc_quad - update download checksum 2213 (mariosasko)
  - Fix newsgroup -fix one instance of 'train' to 'test' 2225 (alexwdong)
  - Fix: xnli - fix tuple key 2233 (NikhilBartwal)
  
  Dataset features
  - Allow stateful function in dataset.map 1960 (mariosasko)
  - MIAM dataset - new citation details 2101 (eusip)
  - [Refactor] Use in-memory/memory-mapped/concatenation tables in Dataset 2025 (lhoestq)
  - Allow pickling of big in-memory tables 2150 (lhoestq)
  - updated user permissions based on umask 2086 2157 (bhavitvyamalik)
  - Fast table queries with interpolation search 2122 (lhoestq)
  - Concat only unique fields in DatasetInfo.from_merge 2163 (mariosasko)
  - Implementation of class_encode_column 2184 2227 (SBrandeis)
  - Add support for axis in concatenate datasets 2151 (albertvillanova)
  - Set default in-memory value depending on the dataset size 2182 (albertvillanova)
  
  Metrics changes
  - New: CER metric 2138 (chutaklee)
  - Update: WER - Compute metric iteratively 2111 (albertvillanova)
  - Update: seqeval - configurable options to `seqeval` metric 2204 (marrodion)
  
  Dataset cards
  - REFreSD: Updated card using information from data statement and datasheet 2082 (mcmillanmajora)
  - Winobiais: fix split infos 2152 (JieyuZhao)
  - all: Fix size categories in YAML Tags 2074 (gchhablani)
  - LinCE: Updating citation information on LinCE readme 2205 (gaguilar)
  - Swda: Update README.md 2235 (PierreColombo)
  
  General improvements and bug fixes
  - Refactorize Metric.compute signature to force keyword arguments only 2079 (albertvillanova)
  - Fix max_wait_time in requests 2085 (lhoestq)
  - Fix copy snippet in docs 2091 (mariosasko)
  - Fix deprecated warning message and docstring 2100 (albertvillanova)
  - Move Dataset.to_csv to csv module 2102 (albertvillanova)
  - Fix: Allows a feature to be named "_type" 2093 (dcfidalgo)
  - copy.deepcopy os.environ instead of copy 2119 (NihalHarish)
  - Replace legacy torch.Tensor constructor with torch.tensor 2126 (mariosasko)
  - Implement Dataset as context manager 2113 (albertvillanova)
  - Fix missing infos from concurrent dataset loading 2137 (lhoestq)
  - Pin fsspec lower than 0.9.0 2172 (lhoestq)
  - Replace assertTrue(isinstance with assertIsInstance in tests 2164 (mariosasko)
  - add social thumbnial 2177 (philschmid)
  - Fix s3fs tests for py36 and py37+ 2183 (lhoestq)
  - Fix typo in huggingface hub 2192 (LysandreJik)
  - Update metadata if dataset features are modified 2087 (mariosasko)
  - fix missing indices_files in load_form_disk 2197 (lhoestq)
  - Fix backward compatibility in Dataset.load_from_disk 2199 (albertvillanova)
  - Fix ArrowWriter overwriting features in ArrowBasedBuilder 2201 (lhoestq)
  - Fix incorrect assertion in builder.py 2110 (dreamgonfly)
  - Remove Python2 leftovers 2208 (mariosasko)
  - Revert breaking change in cache_files property 2217 (lhoestq)
  - Set test cache config 2223 (albertvillanova)
  - Fix map when removing columns on a formatted dataset 2231 (lhoestq)
  - Refactorize tests to use Dataset as context manager 2191 (albertvillanova)
  - Preserve split type when reloading dataset 2168 (mariosasko)
  
  
  Docs
  - make documentation more clear to use different cloud storage 2127 (philschmid)
  - Render docstring return type as inline 2147 (albertvillanova)
  - Add table classes to the documentation 2155 (lhoestq)
  - Pin docutils for better doc 2174 (sgugger)
  - Fix docstrings issues 2081 (albertvillanova)
  - Add code of conduct to the project 2209 (albertvillanova)
  - Add classes GenerateMode, DownloadConfig and Version to the documentation 2202 (albertvillanova)
  - Fix bash snippet formatting in ADD_NEW_DATASET.md 2234 (mariosasko)

1.5.0

Datasets changes
  - New: Europarl Bilingual 1874 (lucadiliello)
  - New: Stanford Sentiment Treebank 1961 (patpizio)
  - New: RO-STS 1978 (lorinczb)
  - New: newspop 1871 (frankier)
  - New: FashionMNIST 1999 (gchhablani)
  - New: Common voice 1886 (BirgerMoell), 2063 (patrickvonplaten)
  - New: Cryptonite 2013 (theo-m)
  - New: RoSent 2011 (gchhablani)
  - New: PersiNLU reading-comprehension 2028 (danyaljj)
  - New: conllpp 1991 (ZihanWangKi)
  - New: LaRoSeDa 2004 (MihaelaGaman)
  - Update: unnecessary docstart check in conll-like datasets 2020 (mariosasko)
  - Update: semeval 2020 task 11 - add article_id and process test set template 1979 (hemildesai)
  - Update: Md gender - card update 2018 (mcmillanmajora)
  - Update: XQuAD - add Romanian 2023 (M-Salti)
  - Update: DROP -  all answers 1980 (KaijuML)
  - Fix: TIMIT ASR - Make sure not only the first sample is used 1995 (patrickvonplaten)
  - Fix: Wikipedia - save memory by replacing root.clear with elem.clear 2037 (miyamonz)
  - Fix: Doc2dial update data_infos and data_loaders 2041 (songfeng)
  - Fix: ZEST - update download link 2057 (matt-peters)
  - Fix: ted_talks_iwslt - fix version error 2064 (mariosasko)
  
  Datasets Features
  - Implement Dataset from CSV 1946 (albertvillanova)
  - Implement Dataset from JSON and JSON Lines 1943 (albertvillanova)
  - Implement Dataset from text 2030 (albertvillanova)
  - Optimize int precision for tokenization 1985 (albertvillanova)
  - This allows to save 75%+ of space when tokenizing a dataset
  
  General Bug fixes and improvements
  - Fix ArrowWriter closes stream at exit 1971 (albertvillanova)
  - feat(docs): navigate with left/right arrow keys 1974 (ydcjeff)
  - Fix various typos/grammer in the docs 2008 (mariosasko)
  - Update format columns in Dataset.rename_columns 2027 (mariosasko)
  - Replace print with logging in dataset scripts 2019 (mariosasko)
  - Raise an error for outdated sacrebleu versions 2033 (lhoestq)
  - Not all languages have 2 digit codes. 2016 (asiddhant)
  - Fix arrow memory checks issue in tests 2042 (lhoestq)
  - Support pickle protocol for dataset splits defined as ReadInstruction 2043 (mariosasko)
  - Preserve column ordering in Dataset.rename_column 2045 (mariosasko)
  - Fix text-classification tags 2049 (gchhablani)
  - Fix docstring rendering of Dataset/DatasetDict.from_csv args 2066 (albertvillanova)
  - Fixes check of TF_AVAILABLE and TORCH_AVAILABLE 2073 (philschmid)
  - Add and fix docstring for NamedSplit 2069 (albertvillanova)
  - Bump huggingface_hub version 2077 (SBrandeis)
  - Fix docstring issues 2072 (albertvillanova)

1.4.1

Fix an issue 1981 with WMT downloads 1982 (albertvillanova)

1.4.0

Datasets Changes
  - New: iapp_wiki_qa_squad 1873 (cstorm125)
  - New: Financial PhraseBank 1866 (frankier)
  - New: CoVoST2 1935 (patil-suraj)
  - New: TIMIT 1903 (vrindaprabhu)
  - New: Mlama (multilingual lama) 1931 (pdufter)
  - New: FewRel 1823 (gchhablani)
  - New: CCAligned Multilingual Dataset 1815 (gchhablani)
  - New: Turkish News Category Lite 1967 (yavuzKomecoglu)
  - Update: WMT - use mirror links 1912 for better download speed (lhoestq)
  - Update: multi_nli - add missing fields 1950 (bhavitvyamalik)
  - Fix: ALT - fix duplicated examples in alt-parallel 1899 (lhoestq)
  - Fix: WMT datasets - fix download errors 1901 (YangWang92), 1902 (lhoestq)
  - Fix: QA4MRE - fix download URLs 1918 (M-Salti)
  - Fix: Wiki_dpr - fix when with_embeddings is False or index_name is "no_index" 1925 (lhoestq)
  - Fix: Wiki_dpr - add missing scalar quantizer 1926 (lhoestq)
  - Fix: GEM - fix the URL filtering for bad MLSUM examples in GEM 1970 (yjernite)
  
  Datasets Features
  - Add to_dict and to_pandas for Dataset 1889 (SBrandeis)
  - Add to_csv for Dataset 1887 (SBrandeis)
  - Add keep_linebreaks parameter to text loader 1913 (lhoestq)
  - Add not-in-place implementations for several dataset transforms 1883 (SBrandeis):
  - This introduces new methods for Dataset objects: rename_column, remove_columns, flatten and cast.
  - The old in-place methods rename_column_, remove_columns_, flatten_ and cast_ are now deprecated.
  - Make DownloadManager downloaded/extracted paths accessible 1846 (albertvillanova)
  - Add cross-platform support for datasets-cli 1951 (mariosasko)
  
  Metrics Changes
  - New: sari metric 1875 (ddhruvkr)
  
  Offline loading
  - Handle timeouts 1952 (lhoestq)
  - Add datasets full offline mode with HF_DATASETS_OFFLINE 1976 (lhoestq)
  
  General improvements and bugfixes
  - Replace flatten_nested 1879 (albertvillanova)
  - add missing info on how to add large files 1885 (stas00)
  - Docs for adding new column on formatted dataset 1888 (lhoestq)
  - Fix PandasArrayExtensionArray conversion to native type 1897 (lhoestq)
  - Bugfix for string_to_arrow timestamp[ns] support 1900 (justin-yan)
  - Fix to_pandas for boolean ArrayXD 1904 (lhoestq)
  - Fix logging imports and make all datasets use library logger 1914 (albertvillanova)
  - Standardizing datasets dtypes 1921 (justin-yan)
  - Remove unused py_utils objects 1916 (albertvillanova)
  - Fix save_to_disk with relative path 1923 (lhoestq)
  - Updating old cards 1928 (mcmillanmajora)
  - Improve typing and style and fix some inconsistencies 1929 (mariosasko)
  - Fix builder config creation with data_dir 1932 (lhoestq)
  - Disallow ClassLabel with no names 1938 (lhoestq)
  - Update documentation with not in place transforms and update DatasetDict 1947 (lhoestq)
  - Documentation for to_csv, to_pandas and to_dict 1953 (lhoestq)
  - typos + grammar 1955 (stas00)
  - Fix unused arguments 1962 (mariosasko)
  - Fix metrics collision in separate multiprocessed experiments 1966 (lhoestq)

1.3.0

Bug fixes and performance improvements.

1.2.1

New Features
  - Fast start up (1690): Importing `datasets` is now significantly faster.
  
  Datasets Changes
  - New: MNIST (1730)
  - New: Korean intonation-aided intention identification dataset (1715)
  - New: Switchboard Dialog Act Corpus (1678)
  - Update: Wiki-Auto - Added unfiltered versions of the training data for the GEM simplification task. (1722)
  - Update: Scientific papers - Mirror datasets zip (1721)
  - Update: Update DBRD dataset card and download URL (1699)
  - Fix: Thainer - fix ner_tag bugs (1695)
  - Fix: reuters21578 - metadata parsing errors (1693)
  - Fix: ade_corpus_v2 - fix config names (1689)
  - Fix: DaNE - fix last example (1688)
  
  Datasets tagging
  - rename "part-of-speech-tagging" tag in some dataset cards (1645)
  
  Bug Fixes
  - Fix column list comparison in transmit format (1719)
  - Fix windows path scheme in cached path (1711)
  
  Docs
  - Add information about caching and verifications in "Load a Dataset" docs (1705)
  
  Moreover many dataset cards of datasets added during the sprint were updated ! Thanks to all the contributors :)

1.2.0

Features
  - Add `shuffle_files` argument to `tfds.load` function. The semantic is the same as in `builder.as_dataset` function, which for now means that by default, files will be shuffled for `TRAIN` split, and not for other splits. Default behaviour will change to always be False at next release.
  - Most datasets now support the new S3 API ([documentation](https://github.com/tensorflow/datasets/blob/master/docs/splits.md#two-apis-s3-and-legacy))
  - Support for uint16 PNG images
  
  Misc
  - Crash while shuffling on Windows
  - Various documentation improvements
  
  New datasets
  - AFLW2000-3D
  - Amazon_US_Reviews
  - binarized_mnist
  - BinaryAlphaDigits
  - Caltech Birds 2010
  - Coil100
  - DeepWeeds
  - Food101
  - MIT Scene Parse 150
  - RockYou leaked password
  - Stanford Dogs
  - Stanford Online Products
  - Visual Domain Decathlon

1.1.3

Datasets changes
  - New: NLI-Tr (787)
  - New: Amazon Reviews (791)(844)(845)(799)
  - New: ASNQ - answer sentence selection (780)
  - New: OpenBookCorpus (856)
  - New: ASLG-PC12 - sign language translation (731)
  - New: Quail - question answering dataset (747)
  - Update: SNLI: Created dataset card snli.md (663)
  - Update: csv - Use pandas reader in csv (857)
  - Better memory management
  - Breaking: the previous `read_options`, `parse_options` and c`onvert_options` are replaced with plain parameters like pandas.read_csv
  - Update: conll2000, conll2003, germeval_14, wnut_17, XTREME PAN-X - Create ClassLabel for labelling tasks datasets (850)
  - Breaking: use of ClassLabel features instead of string features + naming of columns updated for consistency
  - Update: XNLI - Add XNLI train set (781)
  - Update: XSUM - Use full released xsum dataset (754)
  - Update: CompGuessWhat - New version of CompGuessWhat?! with refined annotations (748)
  - Update: CLUE - add OCNLI, a new CLUE dataset (742)
  - Fix: KOR-NLI - Fix csv reader (855)
  - Fix: Discofuse - fix discofuse urls (793)
  - Fix: Emotion - fix description (745)
  - Fix: TREC - update urls (740)
  
  Metrics changes
  - New: accuracy, precision, recall and F1 metrics (825)
  - Fix: squad_v2 (840)
  - Fix: seqeval (810)(738)
  - Fix: Rouge - fix description (774)
  - Fix: GLUE - fix description (734)
  - Fix: BertScore - fix custom baseline (763)
  
  Command line tools
  - add clear_cache parameter in the test command (863)
  
  Dependencies
  - Integrate file_lock inside the lib for better logging control (859)
  
  Dataset features
  - Add writer_batch_size attribute to GeneratorBasedBuilder (828)
  - pretty print dataset objects (725)
  - allow custom split names in text dataset (776)
  
  Tests
  - All configs is a slow test now
  
  Bug fixes
  - Make save function use deterministic global vars order (819)
  - fix type hints pickling in python 3.6 (818)
  - fix metric deletion when attributes are missing (782)
  - Fix custom builder caching (770)
  - Fix metric with cache dir (772)
  - Fix train_test_split output format (719)

1.1.2

Dataset changes
  - Fix: text - use python read instead of pandas reader (715):
  - fix delimiter/overflow issues
  - better memory handling
  
  Bug fixes
  - Fix dataset configuration creation using `data_files` per splits using NamedSplit (706)
  - Fix permission issue on windows - don't use tqdm 4.50.0 (718)

1.1.0

Features
  
  *   Add `in_memory` option to cache small dataset in RAM.
  *   Better sharding, shuffling and sub-split
  *   It is now possible to add arbitrary metadata to `tfds.core.DatasetInfo`
  which will be stored/restored with the dataset. See `tfds.core.Metadata`.
  *   Better proxy support, possibility to add certificate
  *   Add `decoders` kwargs to override the default feature decoding
  ([guide](https://github.com/tensorflow/datasets/tree/master/docs/decode.md)).
  
  New datasets
  
  More datasets added:
  
  *  [downsampled_imagenet](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md#downsampled_imagenet)
  *  [patch_camelyon](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md#patch_camelyon)
  *  [coco](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md#coco) 2017 (with and without panoptic annotations)
  * uc_merced
  * trivia_qa
  * super_glue
  * so2sat
  * snli
  * resisc45
  * pet_finder
  * mnist_corrupted
  * kitti
  * eurosat
  * definite_pronoun_resolution
  * curated_breast_imaging_ddsm
  * clevr
  * bigearthnet

1.0.2

* Add [Apache Beam support](https://www.tensorflow.org/datasets/beam_datasets)
  * Add direct GCS access for MNIST (with `tfds.load('mnist', try_gcs=True)`)
  * More datasets added
  * Option to turn off tqdm bar (`tfds.disable_progress_bar()`)
  * Subsplit do not depends on the number of shard anymore (https://github.com/tensorflow/datasets/issues/292)
  * Various bug fixes
  
  Thanks to all external contributors for raising issues, their feedback and their pull request.

1.0.1

* Fixes bug 52 that was putting the process in Eager mode by default
  * New dataset [`celeb_a_hq`](https://github.com/tensorflow/datasets/blob/master/docs/datasets.md#celeb_a_hq)

1.0.0

*Note that this release had a bug 52 that was putting the process in Eager mode.*
  
  `tensorflow-datasets` is ready-for-use! Please see our [`README`](https://github.com/tensorflow/datasets) and documentation linked there. We've got [25 datasets](https://github.com/tensorflow/datasets/blob/master/docs/datasets.md) currently and are adding more. Please join in and [add](https://github.com/tensorflow/datasets/blob/master/docs/add_dataset.md) (or [request](https://github.com/tensorflow/datasets/issues?q=is%3Aissue+is%3Aopen+label%3A%22dataset+request%22)) a dataset yourself.

0.4.0

Datasets Features
  
  - add from_pandas and from_dict
  - add shard method
  - add rename/remove/cast columns methods
  - faster select method
  - add concatenate datasets
  - add support for taking samples using numpy arrays
  - add export to TFRecords
  - add features parameter when loading from text/json/pandas/csv or when using the map transform
  - add support for nested features for json
  - add DatasetDict object with map/filter/sort/shuffle, that is useful when loading several splits of a dataset
  - add support for post processing Dataset objects in dataset scripts. This is used in Wiki DPR to attach a faiss index to the dataset, in order to be able to query passages for Open Domain QA for example
  - add indexing using FAISS or ElasticSearch:
  - add add_faiss_index and add_elasticsearch_index methods
  - add get_nearest_examples and get_nearest_examples_batch to query the index and return examples
  - add search and search_batch to query the index and return examples ids
  - add save_faiss_index/load_faiss_index to save/load a serialized faiss index
  
  Datasets changes
  
  - new: PG19
  - new: ANLI
  - new: WikiSQL
  - new: qa_zre
  - new: MWSC
  - new: AG news
  - new: SQuADShifts
  - new: doc red
  - new: Wiki DPR
  - new: fever
  - new: hyperpartisan news detection
  - new: pandas
  - new: text
  - new: emotion
  - new: quora
  - new: BioMRC
  - new: web questions
  - new: search QA
  - new: LinCE
  - new: TREC
  - new: Style Change Detection
  - new: 20newsgroup
  - new: social biais frames
  - new: Emo
  - new: web of science
  - new: sogou news
  - new: crd3
  - update: xtreme - PAN-X features changed format. Previously each sample was a word/tag pair, and now each sample is a sentence with word/tag pairs.
  - update: xtreme - add PAWS-X.es
  - update: xsum - manual download is no longer required.
  - new processed: Natural Questions
  
  Metrics Features
  
  - add seed parameter for metrics that does sampling like rouge
  - better installation messages
  
  Metrics changes
  
  - new: bleurt
  - update seqeval: fix entities extraction (more info [here](https://github.com/huggingface/nlp/pull/352))
  
  Bug fixes
  
  - fix bug in map and select that was causing memory issues
  - fix pyarrow version check
  - fix text/json/pandas/csv caching when loading different files in a row
  - fix metrics caching when they have with different config names
  - fix cache that was nto discarded when there's a KeybordInterrupt during .map
  - fix sacrebleu tokenizer's parameter
  - fix docstrings of metrics when multiple instances are created
  
  More Tests
  
  - add tests for features handling in dataset transforms
  - add tests for dataset builders
  - add tests for metrics loading
  
  Backward compatibility
  
  - because there are changes in the dataset_info.json file format, old versions of the lib (<0.4.0) won't be able to load datasets with a post processing field in dataset_info.json

0.3.0

New methods to transform a dataset:
  - `dataset.shuffle`: create a shuffled dataset
  - `dataset.train_test_split`: create a train and a test split (similar to sklearn)
  - `dataset.sort`: create a dataset sorted according to a certain column
  - `dataset.select`: create a dataset with rows selected following the given list of indices
  
  Other features:
  - Better instructions for datasets that require manual download
  > Important: if you load datasets that require manual downloads with an older version of `nlp`, instructions won't be shown and an error will be raised
  - Better access to dataset information (for instance `dataset.feature['label']` or `dataset.dataset_size`)
  
  Datasets:
  - New: cos_e v1.0
  - New: rotten_tomatoes
  - New: german and italian wikipedia
  
  New docs:
  - documentation about splitting a dataset
  
  Bug fixes:
  - fix metric.compute that couldn't write on file
  - fix squad_v2 imports