Pytorch-transformers

Latest version: v1.2.0

Safety actively analyzes 621363 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

1.2.0

New model architecture: DistilBERT

Adding Huggingface's new transformer architecture, **DistilBERT** described in [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5) by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

- `distilbert-base-uncased`: the base DistilBert model
- `distilbert-base-uncased-distilled-squad `: DistilBert model fine-tuned with distillation on SQuAD.

An awaited new pretrained checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint (GPT-2 large) is available in the library under the shortcut name `gpt2-large`: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual pretrained checkpoints in 17 and 100 languages

We have added two new [XLM models in 17 and 100 languages](https://github.com/facebookresearch/XLMpretrained-cross-lingual-language-models) which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

New dependency: `sacremoses`

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by shijie-wu in 1092). We now rely on [`sacremoses`](https://github.com/alvations/sacremoses), a python port of Moses tokenizer, truecaser and normalizer by alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *

\* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules

- Bertology script has seen major improvements (tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (LysandreJik )
- Added GPT-2 Large 774M model (thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (LysandreJik thomwolf )
- Multi-GPU training has been patched (FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (Morizeyao, adai183 )
- Updated the in-depth BERT fine-tuning scripts to `pytorch-transformers` (Morizeyao )
- Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (LysandreJik thomwolf)
- Add `proxies` and `force_download` options to `from_pretrained()` method to be able to use proxies and update cached models/tokenizers (thomwolf)
- Add shortcut to each special tokens with `_id` properties (e.g. `tokenizer.cls_token_id` for the id in the vocabulary of `tokenizer.cls_token`) (thomwolf)
- Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by [fairseq authors](https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.pyL38-L56)) (thomwolf)
- Fix and clean up byte-level BPE tests (thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (LysandreJik )
- Fix a warning raised when the decode method is called for a model with no `sep_token` like GPT-2 (LysandreJik )
- Updated the tokenizers saving method (boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (GuillemGSubies )
- Stable `EnvironmentErrors` have been added to utility files (abhishekraok )
- Fixed distributed barrier hang (VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (LysandreJik )
- Change layer norm code to PyTorch's native layer norm (dhpollack)
- Improve tokenization of XLM for multilingual inputs (shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (thomwolf)
- Added new AutoModels: `AutoModelWithLMHead`, `AutoModelForSequenceClassification`, `AutoModelForQuestionAnswering` (LysandreJik)
- Torch.hub is now based on AutoModels (LysandreJik thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (CrafterKolyan)
- Adding DistilBert model architecture and checpoints (VictorSanh LysandreJik thomwolf)
- Fixes to DistilBert configuration and training script (stefan-it)
- Fix XLNet attention mask for fp16 (ziliwang)
- Documentation auto-deploy (LysandreJik)
- Fix to add a tuple of tokens (epwalsh)
- Update fp16 apex implmentation in scripts (anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (LysandreJik)
- Fix tokenizer reloading in example scripts (rabeehk)
- Fix byte-level decoding error when using added tokens (thomwolf LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (julien-c)

1.1.0

New model: RoBERTa

**[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.

Thanks to Myle Ott from Facebook for his help.

Tokenizer sequence pair handling

Tokenizers get two new methods:

tokenizer.add_special_tokens_single_sentence(token_ids)

and

tokenizer.add_special_tokens_sentences_pair(token_ids_0, token_ids_1)

These methods add the model-specific special tokens to sequences. The sentence pair creates a list of tokens with the `cls` and `sep` tokens according to the way the model was trained.

Sequence pair examples:

For BERT:

[CLS] SEQUENCE_0 [SEP] SEQUENCE_1 [SEP]

For RoBERTa:

[CLS] SEQUENCE_0 [SEP] [SEP] SEQUENCE_1 [SEP]

Tokenizer encoding function
The tokenizer encode function gets two new arguments:

tokenizer.encode(text, text_pair=None, add_special_tokens=False)

If the `text_pair` is specified, `encode` will return a tuple of encoded sequences. If the `add_special_tokens` is set to `True`, the sequences will be built with the models' respective special tokens using the previously described methods.

AutoConfig, AutoModel and AutoTokenizer

There are three new classes with this release that instantiate one of the base model classes of the library from a pre-trained model configuration: `AutoConfig`, `AutoModel`, and `AutoTokenizer`.

Those classes take as input a pre-trained model name or path and instantiate one of the corresponding classes. The input string indicates to the class which architecture should be instantiated. If the string contains "bert", `AutoConfig` instantiates a `BertConfig`, `AutoModel` instantiates a `BertModel` and `AutoTokenizer` instantiates a `BertTokenizer`.

The same can be done for all the library's base models. The Auto classes check for the associated strings: "openai-gpt", "gpt2", "transfo-xl", "xlnet", "xlm" and "roberta". The documentation associated with this change can be found [here](https://huggingface.co/pytorch-transformers/model_doc/auto.html).

Examples

Some examples have been refactored to better reflect the current library. Those are: `simple_lm_finetuning.py`, `finetune_on_pregenerated.py`, as well as `run_glue.py` that has been adapted to the RoBERTa model. The examples `run_squad` and `run_glue.py` have better dataset processing with caching.

Bug fixes and improvements to the library modules

- Fixed multi-gpu training when using FP16 (zijunsun)
- Re-added the possibility to import BertPretrainedModel (thomwolf)
- Improvements to tensorflow -> pytorch checkpoints (dhpollack)
- Fixed save_pretrained to save the correct added tokens (joelgrus)
- Fixed version issues in run_openai_gpt (rabeehk)
- Fixed issue with line return with Chinese BERT (Yiqing-Zhou)
- Added more flexibility regarding the `PretrainedModel.from_pretrained` (xanlsh)
- Fixed issues regarding backward compatibility to Pytorch 1.0.0 (thomwolf)
- Added the unknown token to GPT-2 (thomwolf)

1.0.0

Name change: welcome PyTorch-Transformers 👾

`pytorch-pretrained-bert` => `pytorch-transformers`

Install with `pip install pytorch-transformers`

New models

- **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
- **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.

New pretrained weights

We went from ten (in `pytorch-pretrained-bert` 0.6.2) to twenty-seven (in `pytorch-transformers` 1.0) pretrained model weights.

The newly added model weights are, in summary:
- Two `Whole-Word-Masking` weights for Bert (cased and uncased)
- Three Fine-tuned models for Bert (on SQuAD and MRPC)
- One German model for Bert provided and trained by Deepset.ai (tholor and Timoeller) as detailed in their nice [blogpost](https://deepset.ai/german-bert)
- One OpenAI GPT-2 model (medium size model)
- Two models (base and large) for the newly added XLNet model
- Eight models for the newly added XLM model

The [documentation lists all the models with the shortcut names](https://huggingface.co/pytorch-transformers/pretrained_models.html) and we are currently adding full details of the associated pretraining/fine-tuning parameters.

New documentation

New documentation is currently being created at https://huggingface.co/pytorch-transformers/ and should be finalized over the coming days.

Standard API across models
See the [readme](https://github.com/huggingface/pytorch-transformersquick-tour) for a quick tour of the API.

Main points:

- All models now return `tuples` with various elements depending on the model and the configuration. The docstrings and [documentation](https://huggingface.co/pytorch-transformers/model_doc/bert.htmlbertmodel) list all the expected outputs in order.
- All models can now return the full list of hidden-states (embeddings output + the output hidden-states of each layer)
- All models can now return the full list of attention weights (one tensor of attention weights for each layer)

python
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states=True,
output_attentions=True)
input_ids = torch.tensor([tokenizer.encode("Let's see all hidden-states and attentions on this text")])
all_hidden_states, all_attentions = model(input_ids)[-2:]

Standard API to add tokens to the vocabulary and the model

Using `tokenizer.add_tokens()` and `tokenizer.add_special_tokens()`, one can now easily add tokens to each model vocabulary. The model's input embeddings can be resized accordingly to add associated word embeddings (to be trained) using `model.resize_token_embeddings(len(tokenizer))`

python
tokenizer.add_tokens(['[SPECIAL_TOKEN_1]', '[SPECIAL_TOKEN_2]'])
model.resize_token_embeddings(len(tokenizer))

Serialization

The serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.

python
model.save_pretrained('./my_saved_model_directory/')
tokenizer.save_pretrained('./my_saved_model_directory/')

Reload the model and the tokenizer
model = BertForSequenceClassification.from_pretrained('./my_saved_model_directory/')
tokenizer = BertTokenizer.from_pretrained('./my_saved_model_directory/')

Torchscript

All models are now compatible with Torchscript.

python
model = model_class.from_pretrained(pretrained_weights, torchscript=True)
traced_model = torch.jit.trace(model, (input_ids,))

Examples scripts

The examples scripts have been refactored and gathered in three main examples (`run_glue.py`, `run_squad.py` and `run_generation.py`) which are common to several models and are designed to offer SOTA performances on the respective tasks while being clean starting point to design your own scripts.

Other examples scripts (like `run_bertology.py`) will be added in the coming weeks.

Breaking-changes

The [migration section](https://github.com/huggingface/pytorch-transformersmigrating-from-pytorch-pretrained-bert-to-pytorch-transformers) of the readme lists the breaking changes when switching from `pytorch-pretrained-bert` to `pytorch-transformers`.

The main breaking change is that all models now returns a `tuple` of results.

0.6.2

General updates:
- Better serialization for all models and tokenizers (BERT, GPT, GPT-2 and Transformer-XL) with [best practices for saving/loading](https://github.com/huggingface/pytorch-pretrained-BERTserialization-best-practices) in readme and examples.
- Relaxing network connection requirements (fallback on the last downloaded model in the cache when we can't reach AWS to check eTag)

Breaking changes:
- `warmup_linear` method in `OpenAIAdam` and `BertAdam` is now replaced by flexible [schedule classes](https://github.com/huggingface/pytorch-pretrained-BERTlearning-rate-schedules) for linear, cosine and multi-cycles schedules.

Bug fixes and improvements to the library modules:
- add a flag in BertTokenizer to skip basic tokenization (john-hewitt)
- Allow tokenization of sequences > 512 (CatalinVoss)
- clean up and extend learning rate schedules in BertAdam and OpenAIAdam (lukovnikov)
- Update GPT/GPT-2 Loss computation (CatalinVoss, thomwolf)
- Make the TensorFlow conversion tool more robust (marpaia)
- fixed BertForMultipleChoice model init and forward pass (dhpollack)
- Fix gradient overflow in GPT-2 FP16 training (SudoSharma)
- catch exception if pathlib not installed (potatochip)
- Use Dropout Layer in OpenAIGPTMultipleChoiceHead (pglock)

New scripts and improvements to the examples scripts:
- Add BERT language model fine-tuning scripts (Rocketknight1)
- Added SST-2 task and remaining GLUE tasks to 'run_classifier.py' (ananyahjha93, jplehmann)
- GPT-2 generation fixes (CatalinVoss, spolu, dhanajitb, 8enmann, SudoSharma, cynthia)

0.6.1

Add `regex` to the requirements for OpenAI GPT-2 tokenizer.

0.6.0

Add OpenAI small GPT-2 pretrained model

Page 1 of 2

Releases

Has known vulnerabilities

Pytorch-transformers

Page 1 of 2

1.2.0

1.1.0

1.0.0

0.6.2

0.6.1

0.6.0

Page 1 of 2

Links

Releases