Transformers

Latest version: v4.41.0

Safety actively analyzes 631178 Python packages for vulnerabilities to keep your Python projects secure.

Page 22 of 26

2.6.0

Not secure

New Model: BART (added by sshleifer)
Bart is one of the first Seq2Seq models in the library, and achieves state of the art results on text generation tasks, like abstractive summarization.
Three sets of pretrained weights are released:
- `bart-large`: the pretrained base model
- `bart-large-cnn`: the base model finetuned on the CNN/Daily Mail Abstractive Summarization Task
- `bart-large-mnli`: the base model finetuned on the MNLI classification task.

Related:
- [paper](https://arxiv.org/abs/1910.13461)
- model pages are at [https://huggingface.co/facebook](https://huggingface.co/facebook)
- [docs](https://huggingface.co/transformers/model_doc/bart.html)
- [blogpost](https://sshleifer.github.io/blog_v2/jupyter/2020/03/12/bart.html)

Big thanks to the original authors, especially Mike Lewis, Yinhan Liu, Naman Goyal who helped answer our questions.

Model sharing CLI: support for organizations

The huggingface API for model upload now supports [organisations](https://huggingface.co/organizations).

Notebooks (mfuntowicz)

A few beginner-oriented notebooks were added to the library, aiming at demystifying the two libraries huggingface/transformers and huggingface/tokenizers. Contributors are welcome to contribute links to their notebooks as well.

[pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning) examples (srush)

Examples leveraging pytorch-lightning were added, led by srush.
The first example that was added is the [NER example](https://github.com/huggingface/transformers/tree/master/examples/ner).
The second example is a lightning GLUE example, added by nateraw.

New model architectures: CamembertForQuestionAnswering,

- `CamembertForQuestionAnswering` was added to the library and to the SQuAD script maximeilluin
- `AlbertForTokenClassification` was added to the library and to the NER example marma

Multiple fixes were done on the fast tokenizers to make them entirely compatible with the python tokenizers (mfuntowicz)

Most of these fixes were done in the patch 2.5.1. Fast tokenizers should now have the exact same API as the python ones, with some additional functionalities.

Docker images (mfuntowicz)

Docker images for transformers were added.

Generation overhaul (patrickvonplaten)

- Special token IDs logic were improved in run_generation and in corresponding tests.
- Slow tests for generation were added for pre-trained LM models
- Greedy generation when doing beam search
- Sampling when doing beam search
- Generate functionality was added to TF2: with beam search, greedy generation and sampling.
- Integration tests were added
- `no_repeat_ngram_size` kwarg to avoid redundant generations (sshleifer)

Encoding methods now output only model-specific inputs

Models such as DistilBERT and RoBERTa do not make use of token type IDs. These inputs are not returned by the encoding methods anymore, except if explicitly mentioned during the tokenizer initialization.

Pipelines support summarization (sshleifer)
- The default architecture is `bart-large-cnn`, with the generation parameters published in the paper.

Models may now re-use the cache every time without prompting S3 (BramVanroy)

Previously all attempts to load a model from a pre-trained checkpoint would check that the S3 etag corresponds to the one hosted locally. This has been updated so that an argument `local_files_only` prevents this, which can be useful when a firewall is involved.

Usage examples for common tasks (LysandreJik)

In a continuing effort to onboard new users (new to the lib or new to NLP in general), some usage examples were added to the documentation. These usage examples showcase how to do inference on several tasks:

- NER
- Sequence classification
- Question Answering
- Causal Language Modeling
- Masked Language Modeling

Test suite on GPU (julien-c)

CI now runs on GPU. PyTorch and TensorFlow.

Padding token ID needs to be set in order to pad (patrickvonplaten)

Older tokenizers could pad even when no padding token was defined, which has been updated in this version to match the expected behavior, which is the FastTokenizers' behavior: add a pad token or raise an error when trying to batch without one.

Python >= 3.6

We're now dropping Python 3.5 support.

Community additions/bug-fixes/improvements

- Added a warning when using `add_special_tokens` with the fast tokenizer methods of encoding (LysandreJik)
- `encode_plus` was modified and tested to have the exact same behaviour as `encode`, but batches input
- Cleanup DistilBERT code (guillaume-be)
- Only use `F.gelu` for torch >= 1.4.0 (sshleifer)
- Added a `get_vocab` method to tokenizers, which can be used to retrieve all the vocabulary from the tokenizers. (joeddav)
- Correct behaviour of `special_tokens_mask` when `add_special_tokens=False` (LysandreJik)
- Removed untested `Model2LSTM` and `Model2Model` which was not working
- kwargs were passed to both model and configuration in AutoModels, which made the model crash (LysandreJik)
- Correct transfo-xl tokenization regarding punctions (patrickvonplaten)
- Better docstrings for XLNet (patrickvonplaten)
- Better operations for TPU support (srush)
- XLM-R tokenizer is now tested and bug-free (LysandreJik)
- XLM-R model and tokenizer now have integration tests (patrickvonplaten)
- Better documentation for tokenizers and pipelines (LysandreJik)
- All tests (slow and non-slow) now pass (julien-c, LysandreJik, patrickvonplaten, sshleifer, thomwolf)
- Correct attention mask with GPT-2 when using past (patrickvonplaten)
- fix n_gpu count when no_cuda flag is activated in all examples (VictorSanh)
- Test TF GPT2 for correct behaviour regarding the past and attn mask variable (patrickvonplaten)
- Fixed bug where some missing keys would not be identified (LysandreJik)
- Correct `num_labels` initialization (LysandreJik)
- Model special tokens were added to the pretrained configurations (patrickvonplaten)
- QA models for XLNet, XLM and FlauBERT are now set to their "simple" architectures when using the pipeline.
- GPT-2 XL was added to TensorFlow (patrickvonplaten)
- NER PL example updated (shubhamagarwal92)
- Improved Error message when loading config/model with .from_pretrained() (patrickvonplaten, julien-c)
- Cleaner special token initialization in modeling_xxx.py (patrickvonplaten)
- Fixed the learning rate scheduler placement in the `run_ner.py` script erip
- Use AutoModels in examples (julien-c, lifefeel)

2.5.1

Not secure

AutoTokenizer

AutoTokenizer has been put back to False by default so as to not have a breaking change between 2.4.x and 2.5.x

Fast tokenizers

Bug fixes

Slow tokenizers

Bug fixes related to `batch_encode_plus`

2.5.0

Not secure

Rust tokenizers (mfuntowicz, n1t0 )

- Tokenizers for Bert, Roberta, OpenAI GPT, OpenAI GPT2, TransformerXL are now leveraging [tokenizers](https://github.com/huggingface/tokenizers) library for fast tokenization :rocket:
- AutoTokenizer now defaults to fast tokenizers implementation when available
- Calling batch_encode_plus on fast version of tokenizers will make better usage of CPU-cores.
- Tokenizers leveraging native implementation will use all the CPU-cores by default when calling batch_encode_plus. You can change this behavior by setting the environment variable RAYON_NUM_THREADS = N
- An exception is raised when tokenizing an input with pad_to_max_length=True but no padding token is defined.

Known Issues:
- RoBERTa fast tokenizer implementation has slightly different output when compared to the original Python tokenizer (< 1%).
- Squad example are not currently compatible with the new fast tokenizers thus, it will default to plain-old Python one.

DistilBERT base cased (VictorSanh)

The distilled version of the bert-base-cased BERT checkpoint has been released.

Model cards (julien-c)

Model cards are now stored directly in the repository

CLI script for environment information (BramVanroy)

We now host a CLI script that gathers all the environment information when reporting an issue. The issue templates have been updated accordingly.

Contributors visible on repository (clmnt)

The main contributors as identified by Sourcerer are now visible directly on the repository.

From fine-tuning to pre-training (julien-c )

The language fine-tuning script has been renamed from `run_lm_finetuning` to `run_language_modeling` as it is now also able to train language models from scratch.

Extracting archives now available from `cached_path` (thomwolf )

Slight modification to cached_path so that zip and tar archives can be automatically extracted.

- archives are extracted in the same directory than the (possibly downloaded) archive in a created extraction directory named from the archive.
- automatic extraction is activated by setting `extract_compressed_file=True` when calling `cached_file`.
- the extraction directory is re-used to avoid extracting the archive again unless we set `force_extract=True`, in which case the cached extraction directory is removed and the archive is extracted again.

New activations file (sshleifer )

Several activation functions (relu, swish, gelu, tanh and gelu_new) can now be accessed from the `activations.py` file and be used in the different PyTorch models.

Community additions/bug-fixes/improvements

- Remove redundant hidden states that broke encoder-decoder architectures (LysandreJik )
- Cleaner and more readable code in `test_attention_weights` (sshleifer)
- XLM can be trained on SQuAD in different languages (yuvalpinter)
- Improve test coverage on several models that were ill-tested (LysandreJik)
- Fix issue where TFGPT2 could not be saved (neonbjb )
- Multi-GPU evaluation on run_glue now behaves correctly (peteriz )
- Fix issue with TransfoXL tokenizer that couldn't be saved (dchurchwell)
- More Robust conversion from ALBERT/BERT original checkpoints to huggingface/transformers models (monologg )
- FlauBERT bug fix; only add langs embeddings when there is more than one language handled by the model (LysandreJik )
- Fix CircleCI error with TensorFlow 2.1.0 (mfuntowicz )
- More specific testing advice in contributing (sshleifer )
- BERT decoder: Fix failure with the default attention mask (asivokon )
- Fix a few issues regarding the data preprocessing in `run_language_modeling` (LysandreJik )
- Fix an issue with leading spaces and the RobertaTokenizer (joeddav )
- Added pipeline: `TokenClassificationPipeline`, which is an alias over `NerPipeline` (julien-c )

2.4.1

Not secure

Patched an issue where FlauBERT couldn't be loaded with AutoModel and AutoTokenizer classes.

2.4.0

Not secure

FlauBERT, MMBT, UmBERTo

- MMBT was added to the list of available models, as the first multi-modal model to make it in the library. It can accept a transformer model as well as a computer vision model, in order to classify image and text. The MMBT Model is from [Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine](https://arxiv.org/abs/1909.02950) (https://github.com/facebookresearch/mmbt/)
Added by suvrat96.
- A new Dutch BERT model was added under the `wietsedv/bert-base-dutch-cased` identifier. Added by wietsedv. **[Model page](https://huggingface.co/wietsedv/bert-base-dutch-cased)**
- UmBERTo, a Roberta-based Language Model trained on large Italian Corpora. **[Model page](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)**
- A new French model was added, FlauBERT, based on XLM. The FlauBERT model is from [FlauBERT: Unsupervised Language Model Pre-training for French](https://arxiv.org/abs/1912.05372) (https://github.com/getalp/Flaubert). Four checkpoints are added: small size, base uncased, base cased and large. **[Model page](https://huggingface.co/flaubert/flaubert_large_cased)**

New TF architectures (jplu)

- TensorFlow XLM-RoBERTa was added (jplu )
- TensorFlow CamemBERT was added (jplu )

Python best practices (aaugustin)

- Greatly improved the quality of the source code by leveraging `black`, `isort` and `flake8`. A test was added, `check_code_quality`, which checks that the contributions respect the contribution guidelines related to those tools.
- Similarly, optional imports are better handled and raise more precise errors.
- Cleaned up several requirements files, updated the contribution guidelines and rely on `setup.py` for the necessary dev dependencies.
- you can clean up your code for a PR with (more details in [CONTRIBUTING.md](https://github.com/huggingface/transformers/blob/master/CONTRIBUTING.md)):
bash
make style
make quality

Documentation (LysandreJik)

The [documentation](https://huggingface.co/transformers) was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making `transformers` accessible to a larger audience. A [glossary](https://huggingface.co/transformers/glossary.html) has been added, adding definitions for most frequently used inputs.

Furthermore, some tips are given concerning each model in their documentation pages.

The code samples are now tested on a weekly basis alongside other slow tests.

Improved repository structure (aaugustin)

The source code was moved from `./transformers` to `./src/transformers`. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.

Python 2 is not supported anymore (aaugustin )

Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.

Parallel testing (aaugustin)

Tests can now be run in parallel

Sampling sequence generator (rlouf, thomwolf )

An abstract method was added to `PreTrainedModel`, which is implemented in all models trained with CLM. This abstract method is `generate`, which offers an API for text generation:

- with/without a prompt
- with/without beam search
- with/without greedy decoding/sampling
- with any (and combination) of top-k/top-p/penalized repetitions

Resuming training when interrupted (bkkaggle )

Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.

This applies to the following scripts: `run_glue`, `run_squad`, `run_ner`, `run_xnli`.

CLI (julien-c , mfuntowicz )

Model upload

- The CLI now has better documentation.
- Files can now be removed.

Pipelines

- Expose the number of underlying FastAPI workers
- Async forward methods
- Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH)

Training from scratch (julien-c )

The `run_lm_finetuning.py` script now handles training from scratch.

Changes in the configuration (julien-c )

The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.

New Auto models (thomwolf )

A new type of AutoModel was added: `AutoModelForPreTraining`. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. `BertForPreTraining` for BERT.

HANS dataset (ns-moosavi)

The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.

[BREAKING CHANGES]

Ignored indices in PyTorch loss computing (LysandreJik)

When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to `-1`. We decided to set this value to `-100` instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.

Further help from r0mainK.

Community additions/bug-fixes/improvements

- Can now save and load PreTrainedEncoderDecoder objects (TheEdoardo93)
- RoBERTa now bears more similarity to the FairSeq implementation (DomHudson, thomwolf)
- Examples now better reflect the defaults of the encoding methods (enzoampil)
- TFXLNet now has a correct input mask (thomwolf)
- run_squad was fixed to allow better training for XLNet (importpandas )
- tokenization performance improvement (3-8x) (mandubian)
- RoBERTa was added to the run_squad script (erenup)
- Fixed the special and added tokens tokenization (vitaliyradchenko)
- Fixed an issue with language generation for XLM when having a batch size superior to 1 (patrickvonplaten)
- Fixed an issue with the `generate` method which did not correctly handle the repetition penalty (patrickvonplaten)
- Completed the documentation for `repeating_words_penalty_for_language_generation` (patrickvonplaten)
- `run_generation` now leverages cached past input for models that have access to it (patrickvonplaten)
- Finally manage to patch a rarely occurring bug with DistilBERT, eventually named `DistilHeisenBug` or `HeisenDistilBug` (LysandreJik, with the help of julien-c and thomwolf).
- Fixed an import error in `run_tf_ner` (karajan1001).
- Feature conversion for GLUE now has improved logging messages (simonepri)
- Patched an issue with GPUs and `run_generation` (alberduris)
- Added support for ALBERT and XLMRoBERTa to `run_glue`
- Fixed an issue with the DistilBERT tokenizer not loading correct configurations (LysandreJik)
- Updated the SQuAD for distillation script to leverage the new SQuAD API (LysandreJik)
- Fixed an issue with T5 related to its `rp_bucket` (mschrimpf)
- PPLM now supports repetition penalties (IWillPull)
- Modified the QA pipeline to consider all features for each example (Perseus14)
- Patched an issue with a file lock (dimagalat aaugustin)
- The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (LysandreJik)
- Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (LysandreJik ) Same for XLM-R (maksym-del).
- Fixed the `prepare_for_model` when tensorizing and returning token type IDs (LysandreJik).
- Fixed the XLNet model which wouldn't work with torch 1.4 (julien-c)
- Fetch all possible files remotely (julien-c )
- BERT's BasicTokenizer respects `never_split` parameters (DeNeutoy)
- Add lower bound to tqdm dependency brendan-ai2
- Fixed glue processors failing on tensorflow datasets (neonbjb)
- XLMRobertaTokenizer can now be serialized (brandenchan)
- A classifier dropout was added to ALBERT (peteriz)
- The ALBERT configuration for v2 models were fixed to be identical to those output by Google (LysandreJik )

2.3.0

Not secure

New class `Pipeline` (beta): easily run and use models on down-stream NLP tasks

We have added a new class called `Pipeline` to simply run and use models for several down-stream NLP tasks.

A `Pipeline` is just a tokenizer + model wrapped so they can take human-readable inputs and output human-readable results.

The `Pipeline` will take care of :
tokenizing inputs strings => convert in tensors => run in the model => post-process output

Currently, we have added the following pipelines with a default model for each:
- feature extraction (can be used with any pretrained and finetuned models)
inputs: strings/list of strings – output: list of floats (last hidden-states of the model for each token)
- sentiment classification (DistilBert model fine-tuned on SST-2)
inputs: strings/list of strings – output: list of dict with label/score of the top class
- Named Entity Recognition (XLM-R finetuned on CoNLL2003 by the awesome stefan-it), and
inputs: strings/list of strings – output: list of dict with label/entities/position of the named-entities
- Question Answering (Bert Large whole-word version fine-tuned on SQuAD 1.0)
inputs: dict of strings/list of dict of strings – output: list of dict with text/position of the answers

There are three ways to use pipelines:
- in python:
python
from transformers import pipeline

Test the default model for QA (Bert large finetuned on SQuAD 1.0)
nlp = pipeline('question-answering')
nlp(question= "Where does Amy live ?", context="Amy lives in Amsterdam.")
>>> {'answer': 'Amsterdam', 'score': 0.9657156007786263, 'start': 13, 'end': 21}

Test a specific model for NER (XLM-R finetuned by stefan-it on CoNLL03 English)
nlp = pipeline('ner', model='xlm-roberta-large-finetuned-conll03-english')
nlp("My name is Amy. I live in Paris.")
>>> [{'word': 'Amy', 'score': 0.9999586939811707, 'entity': 'I-PER'},
{'word': 'Paris', 'score': 0.9999983310699463, 'entity': 'I-LOC'}]

- in bash (using the command-line interface)
bash
bash $ echo -e "Where does Amy live?\tAmy lives in Amsterdam" | transformers-cli run --task question-answering
{'score': 0.9657156007786263, 'start': 13, 'end': 22, 'answer': 'Amsterdam'}

- as a REST API
bash
transformers-cli serve --task question-answering

This new feature is currently in `beta` and will evolve in the coming weeks.

CLI tool to upload and share community models

Users can now create accounts on the [huggingface.co](https://huggingface.co/welcome) website and then login using the transformers CLI. Doing so allows users to upload their models to our S3 in their respective directories, so that other users may download said models and use them in their tasks.

Users may upload files or directories.

It's been tested by stefan-it for a German BERT and by singletongue for a Japanese BERT.

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT

- T5 (Pytorch & TF) (from Google) released with the paper [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683), by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
- Japanese BERT (Pytorch & TF) from CL-tohoku, implemented by singletongue
- PPLM (Pytorch) (from Uber AI) released with the paper [Plug and Play Language Models: a Simple Approach to Controlled Text Generation](https://arxiv.org/abs/1912.02164) by Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu.
- XLM-RoBERTa (Pytorch & TF) (from FAIR, implemented by stefan-it) released with the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
- Finnish BERT (Pytorch & TF) (from TurkuNLP) released with the paper [Multilingual is not enough: BERT for Finnish](https://arxiv.org/abs/1912.07076) by Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

Refactoring the SQuAD example

The run_squad script has been massively refactored. The reasons are the following:
- it was made to work with only a few models (BERT, XLNet, XLM and DistilBERT), which had three different ways of encoding sequences. The script had to be individually modified in order to train different models, which would not scale as other models are added to the library.
- the utilities did not rely on the QOL adjustments that were made to the encoding methods these past months.

It now leverages the full capacity of encode_plus, easing the addition of new models to the script. A new method `squad_convert_examples_to_features` encapsulates all of the tokenization.
This method can handle `tensorflow_datasets` as well as squad v1 json files and squad v2 json files.

- ALBERT was added to the SQuAD script

BertAbs summarization

A contribution by rlouf building on the encoder-decoder mechanism to do abstractive summarization.
- Utilities to load the CNN/DailyMail dataset
- BertAbs now usable as a traditional library model (using `from_pretrained()`)
- ROUGE evaluation

New Models

Additional architectures

alexzubiaga added `XLNetForTokenClassification` and `TFXLNetForTokenClassification`

New model cards

Community additions/bug-fixes/improvements

- Added mish activation function digantamisra98
- `run_bertology.py` was updated with correct imports and the ability to overwrite the cache
- Training can be exited and relaunched safely, while keeping the epochs, global steps, scheduler steps and other variables in `run_lm_finetuning.py` bkkaggle
- Tests now run on cuda aaugustin julien-c
- Cleaned up the pytorch to tf conversion script thomwolf
- Progress indicator improvements when downloading pre-trained models leopd
- `from_pretrained()` can now load from urls directly.
- New tests to check that all files are accessible on HuggingFace's S3 rlouf
- Updated tf.shape and tensor.shape to all use shape_list thomwolf
- Valohai integration thomwolf
- Always use SequentialSampler in `run_squad.py` ethanjperez
- Stop using GPU when importing transformers ondewo
- Fixed the XLNet attention output roskoN
- Several QOL adjustments: removed dead code, deep cleaned tests and removed pytest dependency aaugustin
- Fixed an issue with the Camembert tokenization thomwolf
- Correctly create an encoder attention mask from the shape of the hidden states rlouf
- Fixed a non-deterministic behavior when encoding and decoding empty strings pglock
- Fixing tensor creation in encode_plus LysandreJik
- Remove usage of tf.mean which does not exist in TF2 LysandreJik
- A segmentation fault error was fixed (due to scipy 1.4.0) LysandreJik
- Start sunsetting support of Python 2
- An example usage of Model2Model was added to the quickstart.

Page 22 of 26

Releases

Has known vulnerabilities

Previous Next

Transformers

Page 22 of 26

2.6.0

2.5.1

2.5.0

2.4.1

2.4.0

2.3.0

Page 22 of 26

Links

Releases