Transformers

Latest version: v4.41.0

Safety actively analyzes 631249 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 24 of 26

1.10

The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).

* Byebye torch 1.10 by ydshieh in 28207

Model tagging

You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have `trl` tag you can search: https://huggingface.co/models?other=trl&sort=created

* [`core`/ FEAT] Add the possibility to push custom tags using `PreTrainedModel` itself by younesbelkada in 28405 - e.g.

python
from transformers import AutoModelForCausalLM

model_name = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
model = AutoModelForCausalLM.from_pretrained(model_name)

model.add_model_tags(["tag-test"])
model.push_to_hub("llama-tagged")


Bugfixes and improvements

* Fix PatchTSMixer Docstrings by vijaye12 in 27943
* use logger.warning_once to avoid massive outputs by ranchlai in 27428
* Docs for AutoBackbone & Backbone by merveenoyan in 27456
* Fix test for auto_find_batch_size on multi-GPU by muellerzr in 27947
* Update import message by NielsRogge in 27946
* Fix parameter count in readme for mixtral 45b by CyberTimon in 27945
* In PreTrainedTokenizerBase add missing word in error message by petergtz in 27949
* Fix AMD scheduled CI not triggered by ydshieh in 27951
* Add deepspeed test to amd scheduled CI by echarlaix in 27633
* Fix a couple of typos and add an illustrative test by rjenc29 in 26941
* fix bug in mask2former: cost matrix is infeasible by xuchenhao001 in 27897
* Fix for stochastic depth decay rule in the TimeSformer implementation by atawari in 27875
* fix no sequence length models error by AdamLouly in 27522
* [`Mixtral`] Change mistral op order by younesbelkada in 27955
* Update bounding box format everywhere by NielsRogge in 27944
* Support PeftModel signature inspect by dancingpipi in 27865
* fixed typos (issue 27919) by asusevski in 27920
* Hot-fix-mixstral-loss by ArthurZucker in 27948
* Fix link in README.md of Image Captioning by saswatmeher in 27969
* Better key error for AutoConfig by Rocketknight1 in 27976
* [doc] fix typo by stas00 in 27981
* fix typo in dvclive callback by dberenbaum in 27983
* [`Tokenizer Serialization`] Fix the broken serialisation by ArthurZucker in 27099
* [`Whisper`] raise better errors by ArthurZucker in 27971
* Fix PatchTSMixer slow tests by ajati in 27997
* [`CI slow`] Fix expected values by ArthurZucker in 27999
* Fix bug with rotating checkpoints by muellerzr in 28009
* [Doc] Spanish translation of glossary.md by aaronjimv in 27958
* Add model_docs from cpmant.md to derformable_detr.md by rajveer43 in 27884
* well well well by ArthurZucker in 28011
* [`SeamlessM4TTokenizer`] Safe import by ArthurZucker in 28026
* [`core` / `modeling`] Fix training bug with PEFT + GC by younesbelkada in 28031
* Fix AMD push CI not triggered by ydshieh in 28029
* SeamlessM4T: `test_retain_grad_hidden_states_attentions` is flaky by gante in 28035
* Fix languages covered by M4Tv2 by ylacombe in 28019
* Fixed spelling error in T5 tokenizer warning message (s/thouroughly/t… by jeddobson in 28014
* Generate: Mistral/Mixtral FA2 cache fix when going beyond the context window by gante in 28037
* [Seamless] Fix links in docs by sanchit-gandhi in 27905
* Remove warning when Annotion enum is created by amyeroberts in 28048
* [`FA-2`] Fix fa-2 issue when passing `config` to `from_pretrained` by younesbelkada in 28043
* [`Modeling` / `Mixtral`] Fix GC + PEFT issues with Mixtral by younesbelkada in 28061
* [Flax BERT] Update deprecated 'split' method by sanchit-gandhi in 28012
* [Flax LLaMA] Fix attn dropout by sanchit-gandhi in 28059
* Remove SpeechT5 deprecated argument by ylacombe in 28062
* doc: Correct spelling mistake by caiyili in 28064
* [`Mixtral`] update conversion script to reflect new changes by younesbelkada in 28068
* Skip M4T `test_retain_grad_hidden_states_attentions` by ylacombe in 28060
* [LLaVa] Add past_key_values to _skip_keys_device_placement to fix multi-GPU dispatch by aismlv in 28051
* Make GPT2 traceable in meta state by kwen2501 in 28054
* Fix bug for checkpoint saving on multi node training setting by dumpmemory in 28078
* Update fixtures-image-utils by lhoestq in 28080
* Fix `low_cpu_mem_usage` Flag Conflict with DeepSpeed Zero 3 in `from_pretrained` for Models with `keep_in_fp32_modules`" by kotarotanahashi in 27762
* Fix wrong examples in llava usage. by Lyken17 in 28020
* [docs] Trainer by stevhliu in 27986
* [docs] MPS by stevhliu in 28016
* fix resuming from ckpt when using FSDP with FULL_STATE_DICT by pacman100 in 27891
* Fix the deprecation warning of _torch_pytree._register_pytree_node by cyyever in 27803
* Spelling correction by saeneas in 28110
* in peft finetune, only the trainable parameters need to be saved by sywangyi in 27825
* fix ConversationalPipeline docstring by not-lain in 28091
* Disable jitter noise during evaluation in SwitchTransformers by DaizeDong in 28077
* Remove warning if `DISABLE_TELEMETRY` is used by Wauplin in 28113
* Fix indentation error - semantic_segmentation.md by rajveer43 in 28117
* [docs] General doc fixes by stevhliu in 28087
* Fix a typo in tokenizer documentation by mssalvatore in 28118
* [Doc] Fix token link in What 🤗 Transformers can do by aaronjimv in 28123
* When save a model on TPU, make a copy to be moved to CPU by qihqi in 27993
* Update split string in doctest to reflect 28087 by amyeroberts in 28135
* [`Mixtral`] Fix loss + nits by ArthurZucker in 28115
* Update modeling_utils.py by mzelling in 28127
* [docs] Fix mistral link in mixtral.md by aaronjimv in 28143
* Remove deprecated CPU dockerfiles by ashahba in 28149
* Fix FA2 integration by pacman100 in 28142
* [gpt-neox] Add attention_bias config to support model trained without attention biases by dalgarak in 28126
* move code to Trainer.evaluate to enable use of that function with multiple datasets by peter-sk in 27844
* Fix weights not properly initialized due to shape mismatch by ydshieh in 28122
* Avoid unnecessary warnings when loading `CLIPConfig` by ydshieh in 28108
* Update FA2 exception msg to point to hub discussions by amyeroberts in 28161
* Align backbone stage selection with out_indices & out_features by amyeroberts in 27606
* [docs] Trainer docs by stevhliu in 28145
* Fix yolos resizing by amyeroberts in 27663
* disable test_retain_grad_hidden_states_attentions on SeamlessM4TModelWithTextInputTest by dwyatte in 28169
* Fix `input_embeds` docstring in encoder-decoder architectures by gante in 28168
* [Whisper] Use torch for stft if available by sanchit-gandhi in 26119
* Fix slow backbone tests - out_indices must match stage name ordering by amyeroberts in 28186
* Update YOLOS slow test values by amyeroberts in 28187
* Update `docs/source/en/perf_infer_gpu_one.md` by ydshieh in 28198
* Fix ONNX export for causal LM sequence classifiers by removing reverse indexing by dwyatte in 28144
* Add Swinv2 backbone by NielsRogge in 27742
* Fix: [SeamlessM4T - S2TT] Bug in batch loading of audio in torch.Tensor format in the SeamlessM4TFeatureExtractor class by nicholasneo78 in 27914
* Bug: `training_args.py` fix missing import with accelerate with version `accelerate==0.20.1` by michaelfeil in 28171
* Fix the check of models supporting FA/SDPA not run by ydshieh in 28202
* Drop `feature_extractor_type` when loading an image processor file by ydshieh in 28195
* [Whisper] Fix word-level timestamps with bs>1 or num_beams>1 by ylacombe in 28114
* Fixing visualization code for object detection to support both types of bounding box. by Anindyadeep in 27842
* update the logger message with accordant weights_file_name by izyForever in 28181
* [`Llava`] Fix llava index errors by younesbelkada in 28032
* fix FA2 when using quantization by pacman100 in 28203
* small typo by stas00 in 28229
* Update docs around mixing hf scheduler with deepspeed optimizer by dwyatte in 28223
* Fix trainer saving safetensors: metadata is None by hiyouga in 28219
* fix bug:divide by zero in _maybe_log_save_evaluate() by frankenliu in 28251
* [Whisper] Fix errors with MPS backend introduced by new code on word-level timestamps computation by ercaronte in 28288
* Remove fast tokenization warning in Data Collators by dbuos in 28213
* fix documentation for zero_shot_object_detection by not-lain in 28267
* Remove token_type_ids from model_input_names (like 24788) by Apsod in 28325
* Translate contributing.md into Chinese by Mayfsz in 28243
* [docs] Sort es/toctree.yml | Translate performance.md by aaronjimv in 28262
* Fix error in M4T feature extractor by ylacombe in 28340
* README: install transformers from conda-forge channel by kevherro in 28313
* Don't check the device when device_map=auto by yuanwu2017 in 28351
* Fix pos_mask application and update tests accordingly by ferjorosa in 27892
* fix FA2 when using quantization for remaining models by susnato in 28341
* Update VITS modeling to enable ONNX export by echarlaix in 28141
* chore: Fix typo s/exclusivelly/exclusively/ by hugo-syn in 28361
* Enhancing Code Readability and Maintainability with Simplified Activation Function Selection. by hi-sushanta in 28349
* Fix building alibi tensor when num_heads is not a power of 2 by abuelnasr0 in 28380
* remove two deprecated function by statelesshz in 28220
* Bugfix / ffmpeg input device (mic) not working on Windows by Teapack1 in 27051
* [AttentionMaskConverter] fix sdpa unmask unattended by zspo in 28369
* Remove shell=True from subprocess.Popen to Mitigate Security Risk by avimanyu786 in 28299
* Add segmentation map processing to SAM Image Processor by rwood-97 in 27463
* update warning for image processor loading by ydshieh in 28209
* Fix initialization for missing parameters in `from_pretrained` under ZeRO-3 by XuehaiPan in 28245
* Fix `_merge_input_ids_with_image_features` for llava model by VictorSanh in 28333
* Use mmap option to load_state_dict by weimingzha0 in 28331
* [BUG] BarkEosPrioritizerLogitsProcessor eos_token_id use list, tensor size mismatch by inkinworld in 28201
* Skip now failing test in the Trainer tests by muellerzr in 28421
* Support `DeepSpeed` when using auto find batch size by muellerzr in 28088
* Fix number of models in README.md by prasatee in 28430
* CI: limit natten version by gante in 28432
* Fix for checkpoint rename race condition by tblattner in 28364
* Fix load correct tokenizer in Mixtral model documentation by JuanFKurucz in 28437
* [docstring] Fix docstring for ErnieConfig, ErnieMConfig by Sparty in 27029
* [Whisper] Fix slow test by patrickvonplaten in 28407
* Assitant model may on a different device by jiqing-feng in 27995
* Enable multi-label image classification in pipeline by amyeroberts in 28433
* Optimize the speed of the truncate_sequences function. by ikkvix in 28263
* Use python 3.10 for docbuild by ydshieh in 28399
* Fix docker file by ydshieh in 28452
* Set `cache_dir` for `evaluate.load()` in example scripts by aphedges in 28422
* Optionally preprocess segmentation maps for MobileViT by harisankar95 in 28420
* Correctly resolve trust_remote_code=None for AutoTokenizer by Rocketknight1 in 28419
* Fix load balancing loss func for mixtral by liangxuZhang in 28256
* Doc by jiqing-feng in 28431
* Fix docstring checker issues with PIL enums by Rocketknight1 in 28450
* Fix broken link on page by keenranger in 28451
* Mark two logger tests as flaky by amyeroberts in 28458
* Update metadata loading for oneformer by amyeroberts in 28398
* Fix torch.ones usage in xlnet by sungho-ham in 28471
* Generate: deprecate old public functions by gante in 28478
* Docs: add model paths by gante in 28475
* Generate: refuse to save bad generation config files by gante in 28477
* TF: purge `TFTrainer` by gante in 28483
* Fix docstrings and update docstring checker error message by Rocketknight1 in 28460
* Change progress logging to once across all nodes by siddartha-RE in 28373
* Generate: fix candidate device placement by gante in 28493
* Fix paths to AI Sweden Models reference and model loading by JuanFKurucz in 28423
* [`chore`] Update warning text, a word was missing by tomaarsen in 28017
* Don't set `finetuned_from` if it is a local path by ydshieh in 28482
* Add the XPU device check for pipeline mode by yuanwu2017 in 28326
* Tokenizer kwargs in textgeneration pipe by thedamnedrhino in 28362
* [GPTQ] Fix test by SunMarc in 28018
* Fixed minor typos by rishit5 in 28489
* Add a use_safetensors arg to TFPreTrainedModel.from_pretrained() by Rocketknight1 in 28511
* Generate: consolidate output classes by gante in 28494
* fix: sampling in flax keeps EOS by borisdayma in 28378
* improve dev setup comments and hints by 4imothy in 28495
* SiLU activation wrapper for safe importing by amyeroberts in 28509
* Remove `task` arg in `load_dataset` in image-classification example by regisss in 28408
* Improving Training Performance and Scalability Documentation by HamzaFB in 28497
* Fix mismatching loading in from_pretrained with/without accelerate by fxmarty in 28414
* Fix/speecht5 bug by NimaYaqmuri in 28481
* [ `TokenizationUtils`] Fix `add_special_tokens` when the token is already there by ArthurZucker in 28520
* [`TokenizationRoformerFast`] Fix the save and loading by ArthurZucker in 28527
* [`SpeechT5Tokenization`] Add copied from and fix the `convert_tokens_to_string` to match the fast decoding scheme by ArthurZucker in 28522
* Clearer error for SDPA when explicitely requested by fxmarty in 28006
* Add is_model_supported for fx by inisis in 28521
* Config: warning when saving generation kwargs in the model config by gante in 28514
* [Makefile] Exclude research projects from format by patrickvonplaten in 28551
* symbolic_trace: add past_key_values, llama, sdpa support by fxmarty in 28447
* Allow to train dinov2 with different dtypes like bf16 by StarCycle in 28504
* Fix Switch Transformers When sparse_step = 1 by agemagician in 28564
* Save `Processor` by ydshieh in 27761
* Use `weights_only` only if torch >= 1.13 by ydshieh in 28506
* [`Core Tokenization`] Support a fix for spm fast models by ArthurZucker in 26678
* Use `LoggingLevel` context manager in 3 tests by ydshieh in 28575
* Fix the documentation checkpoint for xlm-roberta-xl by jeremyfowers in 28567
* [ASR Pipe] Update init to set model type and subsequently call parent init method by sanchit-gandhi in 28486
* [Whisper Tok] Move token ids to CPU when computing offsets by sanchit-gandhi in 28485
* [Whisper] Fix audio classification with weighted layer sum by sanchit-gandhi in 28563
* Making CTC training example more general by ylacombe in 28582
* Don't save `processor_config.json` if a processor has no extra attribute by ydshieh in 28584
* Fix wrong xpu device in DistributedType.MULTI_XPU mode by faaany in 28386
* [GPTNeoX] Fix BC issue with 4.36 by ArthurZucker in 28602

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* aaronjimv
* [Doc] Spanish translation of glossary.md (27958)
* [Doc] Fix token link in What 🤗 Transformers can do (28123)
* [docs] Fix mistral link in mixtral.md (28143)
* [docs] Sort es/toctree.yml | Translate performance.md (28262)
* rajveer43
* Add model_docs from cpmant.md to derformable_detr.md (27884)
* Fix indentation error - semantic_segmentation.md (28117)
* poedator
* 4D `attention_mask` support (27539)
* [bnb] Let's make serialization of 4bit models possible (26037)
* connor-henderson
* Add FastSpeech2Conformer (23439)
* JustinLin610
* Add qwen2 (28436)
* SangbumChoi
* enable training mask2former and maskformer for transformers trainer by SangbumChoi in 28277
* [DETA] Improvement and Sync from DETA especially for training by SangbumChoi in 27990
* fix auxiliary loss training in DetrSegmentation by SangbumChoi in 28354

1.10.0

1.9

The last version to support PyTorch 1.9 was 4.30.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.10 and up, we do not support PyTorch 1.9 for v4.31 and up.

* Byebye pytorch 1.9 by ydshieh in 24080

RoPE scaling

This PR adds RoPE scaling to the LLaMa and GPTNeoX families of models. It allows us to extrapolate and go beyond the original maximum sequence length (e.g. 2048 tokens on LLaMA), without fine-tuning. It offers two strategies:
- Linear scaling
- Dynamic NTK scaling

* Llama/GPTNeoX: add RoPE scaling by gante in 24653

Agents

Tools now return a type that is specific to agents. This type can return a serialized version of itself (a string), that either points to a file on-disk or to the object's content. This should make interaction with text-based systems much simpler.

* Tool types by LysandreJik in 24032

Tied weights load

Models with potentially tied weights dropped off some keys from the state dict even when the weights were not tied. This has now been fixed and more generally, the whole experience of loading a model with state dict that don't match exactly should be improved in this release.

* Tied weights load by sgugger in 24310
* Clean load keys by sgugger in 24505

Whisper word-level timestamps

This PR adds a method of predicting timestamps at the word (or even token) level, by analyzing the cross-attentions and applying dynamic time warping.

* add word-level timestamps to Whisper by hollance in 23205

Auto model addition

A new auto model is added, `AutoModelForTextEncoding`. It is to be used when you want to extract the text encoder from an encoder-decoder architecture.

* [AutoModel] Add AutoModelForTextEncoding by sanchit-gandhi in 24305

Model deprecation

Transformers is growing a lot and to ease a bit the burden of maintenance on our side, we have taken the decision to deprecate models that are not used a lot. Those models will never actually disappear from the library, but we will stop testing them or accepting PRs modifying them.
(enfin ça
The criteria to identify models to deprecate was less than 1,000 unique downloads in the last 30 days for models that are at least one year old. The list of deprecated models is:

- BORT
- M-CTC-T
- MMBT
- RetriBERT
- TAPEX
- Trajectory Transformer
- VAN

* Deprecate models by sgugger in 24787

Breaking changes

Fixes an issue with stripped spaces for the T5 family tokenizers. If this impacts negatively inference/training with your models, please let us know by opening an issue.

* ⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker in 24565

Bugfixes and improvements

* add trust_remote_code option to CLI download cmd by radames in 24097
* Fix typo in Llama docstrings by Kh4L in 24020
* Avoid `GPT-2` daily CI job OOM (in TF tests) by ydshieh in 24106
* [Lllama] Update tokenization code to ensure parsing of the special tokens [core] by ArthurZucker in 24042
* PLAM => PaLM by xingener in 24129
* [`bnb`] Fix bnb config json serialization by younesbelkada in 24137
* Correctly build models and import call_context for older TF versions by Rocketknight1 in 24138
* Generate: PT's `top_p` enforces `min_tokens_to_keep` when it is `1` by gante in 24111
* fix bugs with trainer by pacman100 in 24134
* Fix TF Rag OOM issue by ydshieh in 24122
* Fix SAM OOM issue on CI by ydshieh in 24125
* Fix XGLM OOM on CI by ydshieh in 24123
* [`SAM`] Fix sam slow test by younesbelkada in 24140
* [lamaTokenizerFast] Update documentation by ArthurZucker in 24132
* [BlenderBotSmall] Update doc example by ArthurZucker in 24092
* Fix Pipeline CI OOM issue by ydshieh in 24124
* [documentation] grammatical fixes in image_classification.mdx by LiamSwayne in 24141
* Fix typo in streamers.py by freddiev4 in 24144
* [tests] fix bitsandbytes import issue by stas00 in 24151
* Avoid OOM in doctest CI by ydshieh in 24139
* Fix `Wav2Vec2` CI OOM by ydshieh in 24190
* Fix push to hub by NielsRogge in 24187
* Change ProgressCallback to use dynamic_ncols=True by gmlwns2000 in 24101
* [i18n]Translated "attention.mdx" to korean by kihoon71 in 23878
* Generate: force caching on the main model, in assisted generation by gante in 24177
* Fix device issue in `OpenLlamaModelTest::test_model_parallelism` by ydshieh in 24195
* Update `GPTNeoXLanguageGenerationTest` by ydshieh in 24193
* typo: fix typos in CONTRIBUTING.md and deepspeed.mdx by zsj9509 in 24184
* Generate: detect special architectures when loaded from PEFT by gante in 24198
* 🌐 [i18n-KO] Translated tasks_summary.mdx to Korean by kihoon71 in 23977
* 🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 by muellerzr in 24028
* Fix `_load_pretrained_model` by SunMarc in 24200
* Fix steps bugs in no trainer examples by Ethan-yt in 24197
* Skip RWKV test in past CI by ydshieh in 24204
* Remove unnecessary aten::to overhead in llama by fxmarty in 24203
* Update `WhisperForAudioClassification` doc example by ydshieh in 24188
* Finish dataloader integration by muellerzr in 24201
* Add the number of `model` test failures to slack CI report by ydshieh in 24207
* fix: TextIteratorStreamer cannot work with pipeline by yuanwu2017 in 23641
* Update `(TF)SamModelIntegrationTest` by ydshieh in 24199
* Improving error message when using `use_safetensors=True`. by Narsil in 24232
* Safely import pytest in testing_utils.py by amyeroberts in 24241
* fix overflow when training mDeberta in fp16 by sjrl in 24116
* deprecate `use_mps_device` by pacman100 in 24239
* Tied params cleanup by sgugger in 24211
* [Time Series] use mean scaler when scaling is a boolean True by kashif in 24237
* TF: standardize `test_model_common_attributes` for language models by gante in 23457
* Generate: GenerationConfig can overwrite attributes at from_pretrained time by gante in 24238
* Add `torch >=1.12` requirement for `Tapas` by ydshieh in 24251
* Update urls in warnings for rich rendering by IvanReznikov in 24136
* Fix how we detect the TF package by Rocketknight1 in 24255
* Stop storing references to bound methods via tf.function by Rocketknight1 in 24146
* Skip `GPT-J` fx tests for torch < 1.12 by ydshieh in 24256
* docs wrt using accelerate launcher with trainer by pacman100 in 24250
* update FSDP save and load logic by pacman100 in 24249
* Fix URL in comment for contrastive loss function by taepd in 24271
* QA doc: import torch before it is used by ByronHsu in 24228
* Skip some `TQAPipelineTests` tests in past CI by ydshieh in 24267
* TF: CTRL with native embedding layers by gante in 23456
* Adapt Wav2Vec2 conversion for MMS lang identification by patrickvonplaten in 24234
* Update check of core deps by sgugger in 24277
* `Pix2StructImageProcessor` requires `torch>=1.11.0` by ydshieh in 24270
* Fix Debertav2 embed_proj by WissamAntoun in 24205
* Clean up old Accelerate checks by sgugger in 24279
* Fix bug in slow tokenizer conversion, make it a lot faster by stephantul in 24266
* Fix `check_config_attributes`: check all configuration classes by ydshieh in 24231
* Fix LLaMa beam search when using parallelize by FeiWang96 in 24224
* remove unused is_decoder parameter in DetrAttention by JayL0321 in 24226
* Split common test from core tests by sgugger in 24284
* [fix] bug in BatchEncoding.__getitem__ by flybird1111 in 24293
* Fix image segmentation tool bug by amyeroberts in 23897
* [Docs] Improve docs for MMS loading of other languages by patrickvonplaten in 24292
* Update README_zh-hans.md by CooperFu in 24181
* deepspeed init during eval fix by pacman100 in 24298
* [EnCodec] Changes for 32kHz ckpt by sanchit-gandhi in 24296
* [Docs] Fix the paper URL for MMS model by hitchhicker in 24302
* Update tokenizer_summary.mdx (grammar) by belladoreai in 24286
* Beam search type by jprivera44 in 24288
* Make `can_generate` as class method by ydshieh in 24299
* Update test versions on README.md by sqali in 24307
* [`SwitchTransformers`] Fix return values by ArthurZucker in 24300
* Fix functional TF Whisper and modernize tests by Rocketknight1 in 24301
* Big TF test cleanup by Rocketknight1 in 24282
* Fix ner average grouping with no groups by Narsil in 24319
* Fix ImageGPT doc example by amyeroberts in 24317
* Add test for proper TF input signatures by Rocketknight1 in 24320
* Adding ddp_broadcast_buffers argument to Trainer by TevenLeScao in 24326
* error bug on saving distributed optim state when using data parallel by xshaun in 24108
* 🌐 [i18n-KO] Fixed `tutorial/preprocessing.mdx` by sim-so in 24156
* pin `apex` to a speicifc commit (for DeepSpeed CI docker image) by ydshieh in 24351
* byebye Hub connection timeout by ydshieh in 24350
* Clean up disk sapce during docker image build for `transformers-pytorch-gpu` by ydshieh in 24346
* Fix `KerasMetricCallback`: pass `generate_kwargs` even if `use_xla_generation` is False by Kripner in 24333
* Fix device issue in `SwitchTransformers` by ydshieh in 24352
* Update MMS integration docs by vineelpratap in 24311
* Make `AutoFormer` work with previous torch version by ydshieh in 24357
* Fix ImageGPT doctest by amyeroberts in 24353
* Fix link to documentation in Install from Source by SoyGema in 24336
* docs: add BentoML to awesome-transformers by aarnphm in 24344
* [Doc Fix] Fix model name path in the transformers doc for AutoClasses by riteshghorse in 24329
* Fix the order in `GPTNeo`'s docstring by qgallouedec in 24358
* Respect explicitly set framework parameter in pipeline by denis-ismailaj in 24322
* Allow passing kwargs through to TFBertTokenizer by Rocketknight1 in 24324
* Fix resuming PeftModel checkpoints in Trainer by llohann-speranca in 24274
* TensorFlow CI fixes by Rocketknight1 in 24360
* Update tiny models for pipeline testing. by ydshieh in 24364
* [modelcard] add audio classification to task list by sanchit-gandhi in 24363
* [Whisper] Make tests faster by sanchit-gandhi in 24105
* Rename test to be more accurate by sgugger in 24374
* Add a check in `ImageToTextPipeline._forward` by ydshieh in 24373
* [Tokenizer doc] Clarification about `add_prefix_space` by ArthurZucker in 24368
* style: add BitsAndBytesConfig __repr__ function by aarnphm in 24331
* Better test name and enable pipeline test for `pix2struct` by ydshieh in 24377
* Skip a tapas (tokenization) test in past CI by ydshieh in 24378
* [Whisper Docs] Nits by ArthurZucker in 24367
* [GPTNeoX] Nit in config by ArthurZucker in 24349
* [Wav2Vec2 - MMS] Correct directly loading adapters weights by patrickvonplaten in 24335
* Migrate doc files to Markdown. by sgugger in 24376
* Update deprecated torch.ger by kit1980 in 24387
* [docs] Fix NLLB-MoE links by stevhliu in 24388
* Add `ffmpeg` for `doc_test_job` on CircleCI by ydshieh in 24397
* byebye Hub connection timeout - Recast by ydshieh in 24399
* fix type annotation for debug arg by Bearnardd in 24033
* [Trainer] Fix optimizer step on PyTorch TPU by cowanmeg in 24389
* Fix gradient checkpointing + fp16 autocast for most models by younesbelkada in 24247
* Clean up dist import by muellerzr in 24402
* Check auto mappings could be imported via `from transformers` by ydshieh in 24400
* Remove redundant code from TrainingArgs by muellerzr in 24401
* Explicit arguments in `from_pretrained` by ydshieh in 24306
* [ASR pipeline] Check for torchaudio by sanchit-gandhi in 23953
* TF safetensors reduced mem usage by Rocketknight1 in 24404
* Skip `test_conditional_generation_pt_pix2struct` in Past CI (torch < 1.11) by ydshieh in 24417
* [`bnb`] Fix bnb serialization issue with new release by younesbelkada in 24416
* Revert "Fix gradient checkpointing + fp16 autocast for most models" by younesbelkada in 24420
* Fix `save_cache` version in `config.yml` by ydshieh in 24419
* Update RayTune doc link for Hyperparameter tuning by JoshuaEPSamuel in 24422
* TF CI fix for Segformer by Rocketknight1 in 24426
* Refactor hyperparameter search backends by alexmojaki in 24384
* Clarify batch size displayed when using DataParallel by sgugger in 24430
* Save `site-packages` as cache in CircleCI job by ydshieh in 24424
* [llama] Fix comments in weights converter by weimingzha0 in 24436
* [`Trainer`] Fix `.to` call on 4bit models by younesbelkada in 24444
* fix the grad_acc issue at epoch boundaries by pacman100 in 24415
* Replace python random with torch.rand to enable dynamo.export by BowenBao in 24434
* Fix typo by siryuon in 24440
* Fix some `TFWhisperModelIntegrationTests` by ydshieh in 24428
* fixes issue when saving fsdp via accelerate's FSDP plugin by pacman100 in 24446
* Allow dict input for audio classification pipeline by sanchit-gandhi in 23445
* Update `JukeboxConfig.from_pretrained` by ydshieh in 24443
* Improved keras imports by Rocketknight1 in 24448
* add missing alignment_heads to Whisper integration test by hollance in 24487
* Fix tpu_metrics_debug by cowanmeg in 24452
* Update AlbertModel type annotation by amyeroberts in 24450
* [`pipeline`] Fix str device issue by younesbelkada in 24396
* when resume from peft checkpoint, the model should be trainable by sywangyi in 24463
* deepspeed z1/z2 state dict fix by pacman100 in 24489
* Update `InstructBlipModelIntegrationTest` by ydshieh in 24490
* Update token_classification.md by condor-cp in 24484
* Add support for for loops in python interpreter by sgugger in 24429
* [`InstructBlip`] Add accelerate support for instructblip by younesbelkada in 24488
* Compute `dropout_probability` only in training mode by ydshieh in 24486
* Fix 'local_rank' AttiributeError in Trainer class by mocobeta in 24297
* Compute `dropout_probability` only in training mode (SpeechT5) by ydshieh in 24498
* Fix link in utils by SoyGema in 24501
* 🚨🚨 Fix group beam search by hukuda222 in 24407
* Generate: `group_beam_search` requires `diversity_penalty>0.0` by gante in 24456
* Generate: `min_tokens_to_keep` has to be `>= 1` by gante in 24453
* Fix TypeError: Object of type int64 is not JSON serializable by xiaoli in 24340
* Fix poor past ci by ydshieh in 24485
* 🌐 [i18n-KO] Translated `tflite.mdx` to Korean by 0525hhgus in 24435
* use accelerate autocast in jit eval path, since mix precision logic is… by sywangyi in 24460
* Update hyperparameter_search.py by pacman100 in 24515
* [`T5`] Add T5ForQuestionAnswering and MT5ForQuestionAnswering by sjrl in 24481
* set model to training mode before accelerate.prepare by sywangyi in 24520
* Update `huggingface_hub` commit sha by ydshieh in 24527
* Find module name in an OS-agnostic fashion by sgugger in 24526
* Fix LR scheduler based on bs from auto bs finder by muellerzr in 24521
* [Mask2Former] Remove SwinConfig by NielsRogge in 24259
* Allow backbones not in backbones_supported - Maskformer Mask2Former by amyeroberts in 24532
* Fix Typo by tony9402 in 24530
* Finishing tidying keys to ignore on load by sgugger in 24535
* Add bitsandbytes support for gpt2 models by DarioSucic in 24504
* ⚠️ Time to say goodbye to py37 by ydshieh in 24091
* Unpin DeepSpeed and require DS >= 0.9.3 by ydshieh in 24541
* Allow for warn_only selection in enable_full_determinism by Frank995 in 24496
* Fix typing annotations for FSDP and DeepSpeed in TrainingArguments by mryab in 24549
* Update PT/TF weight conversion after 24030 by ydshieh in 24547
* Update `EncodecIntegrationTest` by ydshieh in 24553
* [`gpt2-int8`] Add gpt2-xl int8 test by younesbelkada in 24543
* Fix processor __init__ bug if image processor undefined by amyeroberts in 24554
* [`InstructBlip`] Add instruct blip int8 test by younesbelkada in 24555
* Update PT/Flax weight conversion after 24030 by ydshieh in 24556
* Make PT/Flax tests could be run on GPU by ydshieh in 24557
* Update masked_language_modeling.md by condor-cp in 24560
* Fixed OwlViTModel inplace operations by pasqualedem in 24529
* Update old existing feature extractor references by amyeroberts in 24552
* Fix Typo by tony9402 in 24559
* Fix annotations by tony9402 in 24571
* Docs: 4 bit doc corrections by gante in 24572
* Revert "Fix typing annotations for FSDP and DeepSpeed in TrainingArguments" by sgugger in 24574
* Update some torchscript tests after 24505 by ydshieh in 24566
* Removal of deprecated vision methods and specify deprecation versions by amyeroberts in 24570
* Fix ESM models buffers by sgugger in 24576
* Check all objects are equally in the main `__init__` file by ydshieh in 24573
* Fix annotations by tony9402 in 24582
* fix peft ckpts not being pushed to hub by pacman100 in 24578
* Udate link to RunHouse hardware setup documentation. by BioGeek in 24590
* Show a warning for missing attention masks when pad_token_id is not None by hackyon in 24510
* Make (TF) CI faster (test only a subset of model classes) by ydshieh in 24592
* Speed up TF tests by reducing hidden layer counts by Rocketknight1 in 24595
* [several models] improve readability by stas00 in 24585
* Use protobuf 4 by ydshieh in 24599
* Limit Pydantic to V1 in dependencies by lig in 24596
* 🌐 [i18n-KO] Translated `perplexity.mdx` to Korean by HanNayeoniee in 23850
* [Time-Series] Added blog-post to tips by elisim in 24482
* Pin `Pillow` for now by ydshieh in 24633
* Fix loading dataset docs link in run_translation.py example by SoyGema in 24594
* Generate: multi-device support for contrastive search by gante in 24635
* Generate: force cache with `inputs_embeds` forwarding by gante in 24639
* precompiled_charsmap checking before adding to the normalizers' list for XLNetTokenizerFast conversion. by shahad-mahmud in 24618
* Fix audio feature extractor deps by sanchit-gandhi in 24636
* llama fp16 torch.max bug fix by prathikr in 24561
* documentation_tests.txt - sort filenames alphabetically by amyeroberts in 24647
* Update warning messages reffering to post_process_object_detection by rafaelpadilla in 24649
* Add `finetuned_from` property in the autogenerated model card by sgugger in 24528
* Make warning disappear for remote code in pipelines by sgugger in 24603
* Fix `EncodecModelTest::test_multi_gpu_data_parallel_forward` by ydshieh in 24663
* Fix `VisionTextDualEncoderIntegrationTest` by ydshieh in 24661
* Add `is_torch_mps_available` function to utils by NripeshN in 24660
* Unpin `huggingface_hub` by ydshieh in 24667
* Fix model referenced and results in documentation. Model mentioned was inaccessible by rafaelpadilla in 24609
* Add Nucleotide Transformer notebooks and restructure notebook list by Rocketknight1 in 24669
* LlamaTokenizer should be picklable by icyblade in 24681
* Add dropouts to GPT-NeoX by ZHAOTING in 24680
* DeepSpeed/FSDP ckpt saving utils fixes and FSDP training args fixes by pacman100 in 24591
* Avoid import `sentencepiece_model_pb2` in `utils.__init__.py` by ydshieh in 24689
* Fix integration with Accelerate and failing test by muellerzr in 24691
* [`MT5`] Fix CONFIG_MAPPING issue leading it to load umt5 class by ArthurZucker in 24678
* Fix flaky `test_for_warning_if_padding_and_no_attention_mask` by ydshieh in 24706
* Whisper: fix prompted max length by gante in 24666
* Enable `conversational` pipeline for `GPTSw3Tokenizer` by saattrupdan in 24648
* [`T5`] Adding model_parallel = False to `T5ForQuestionAnswering` and `MT5ForQuestionAnswering` by sjrl in 24684
* Docs: change some `input_ids` doc reference from `BertTokenizer` to `AutoTokenizer` by gante in 24730
* add link to accelerate doc by SunMarc in 24601
* [Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words by ArthurZucker in 24622
* Fix typo in LocalAgent by jamartin9 in 24736
* fix: Text splitting in the BasicTokenizer by connor-henderson in 22280
* Docs: add `kwargs` type to fix formatting by gante in 24733
* add gradient checkpointing for distilbert by jordane95 in 24719
* Skip keys not in the state dict when finding mismatched weights by sgugger in 24749
* Fix non-deterministic Megatron-LM checkpoint name by janEbert in 24674
* [InstructBLIP] Fix bos token of LLaMa checkpoints by NielsRogge in 24492
* Skip some slow tests for doctesting in PRs (Circle)CI by ydshieh in 24753
* Fix lr scheduler not being reset on reruns by muellerzr in 24758
* :bug: Handle empty gen_kwargs for seq2seq trainer prediction_step function by gkumbhat in 24759
* Allow existing configs to be registered by sgugger in 24760
* Unpin protobuf in docker file (for daily CI) by ydshieh in 24761
* Fix eval_accumulation_steps leading to incorrect metrics by muellerzr in 24756
* Add MobileVitV2 to doctests by amyeroberts in 24771
* Docs: Update logit processors __call__ docs by gante in 24729
* Replacement of 20 asserts with exceptions by Baukebrenninkmeijer in 24757
* Update default values of bos/eos token ids in `CLIPTextConfig` by ydshieh in 24773
* Fix pad across processes dim in trainer and not being able to set the timeout by muellerzr in 24775
* gpt-bigcode: avoid `zero_` to support Core ML by pcuenca in 24755
* Remove WWT from README by LysandreJik in 24672
* Rm duplicate pad_across_processes by muellerzr in 24780
* Revert "Unpin protobuf in docker file (for daily CI)" by ydshieh in 24800
* Removing unnecessary `device=device` in modeling_llama.py by Liyang90 in 24696
* [fix] Change the condition of ValueError in "convert_checkpoint_from_transformers_to_megatron" by SeongBeomLEE in 24769
* [DOC] Clarify relationshi load_best_model_at_end and save_total_limit by BramVanroy in 24614
* Upgrade jax/jaxlib/flax pin versions by ydshieh in 24791
* Fix MobileVitV2 doctest checkpoint by amyeroberts in 24805
* Skip torchscript tests for `MusicgenForConditionalGeneration` by ydshieh in 24782
* Generate: add SequenceBiasLogitsProcessor by gante in 24334
* Add accelerate version in transformers-cli env by amyeroberts in 24806
* Fix typo 'submosules' by dymil in 24809
* Remove Falcon docs for the release until TGI is ready by Rocketknight1 in 24808
* Update setup.py to be compatible with pipenv by georgiemathews in 24789
* Use _BaseAutoModelClass's register method by fadynakhla in 24810
* Run hub tests by sgugger in 24807
* Copy code when using local trust remote code by sgugger in 24785
* Fixing double `use_auth_token.pop` (preventing private models from being visible). by Narsil in 24812
* set correct model input names for gptsw3tokenizer by DarioSucic in 24788
* Check models used for common tests are small by sgugger in 24824
* [🔗 Docs] Fixed Incorrect Migration Link by kadirnar in 24793
* deprecate `sharded_ddp` training argument by statelesshz in 24825
* 🌐 [i18n-KO] Translated `custom_tools.mdx` to Korean by sim-so in 24580
* Remove unused code in GPT-Neo by namespace-Pt in 24826
* Add Multimodal heading and Document question answering in task_summary.mdx by y3sar in 23318
* Fix `is_vision_available` by ydshieh in 24853
* Fix comments for `_merge_heads` by bofenghuang in 24855
* fix broken links in READMEs by younesbelkada in 24861
* Add TAPEX to the list of deprecated models by sgugger in 24859

* Fix token pass by sgugger in 24862

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* hollance
* [WIP] add EnCodec model (23655)
* add word-level timestamps to Whisper (23205)
* add missing alignment_heads to Whisper integration test (24487)
* sim-so
* 🌐 [i18n-KO] Fixed `tutorial/preprocessing.mdx` (24156)
* 🌐 [i18n-KO] Translated `custom_tools.mdx` to Korean (24580)
* novice03
* Add Multi Resolution Analysis (MRA) (New PR) (24513)
* jegork
* Add ViViT (22518)

1.6

Llava next is the next version of Llava, which includes better support for non padded images, improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.

Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:
- Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.
- Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
- Better visual conversation for more scenarios, covering different applications.
- Better world knowledge and logical reasoning.
- Along with performance improvements, LLaVA-NeXT maintains the minimalist design and data efficiency of LLaVA-1.5. It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. The largest 34B variant finishes training in ~1 day with 32 A100s.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/llava_next_overview.png"
alt="drawing" width="600"/>

<small> LLaVa-NeXT incorporates a higher input resolution by encoding various patches of the input image. Taken from the <a href="https://arxiv.org/abs/2310.03744">original paper.</a> </small>

MusicGen Melody

The MusicGen Melody model was proposed in [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.

MusicGen Melody is a single stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations. MusicGen is then trained to predict discrete audio tokens, or audio codes, conditioned on these hidden-states. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform.

Through an efficient token interleaving pattern, MusicGen does not require a self-supervised semantic representation of the text/audio prompts, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Instead, it is able to generate all the codebooks in a single forward pass.

* Add MusicGen Melody by ylacombe in 28819

PvT-v2

The PVTv2 model was proposed in [PVT v2: Improved Baselines with Pyramid Vision Transformer](https://arxiv.org/abs/2106.13797) by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. As an improved variant of PVT, it eschews position embeddings, relying instead on positional information encoded through zero-padding and overlapping patch embeddings. This lack of reliance on position embeddings simplifies the architecture, and enables running inference at any resolution without needing to interpolate them.

* Add PvT-v2 Model by FoamoftheSea in 26812

UDOP

The UDOP model was proposed in [Unifying Vision, Text, and Layout for Universal Document Processing](https://arxiv.org/abs/2212.02623) by Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal. UDOP adopts an encoder-decoder Transformer architecture based on [T5](https://huggingface.co/docs/transformers/main/en/model_doc/t5) for document AI tasks like document image classification, document parsing and document visual question answering.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/udop_architecture.jpg"
alt="drawing" width="600"/>

<small> UDOP architecture. Taken from the <a href="https://arxiv.org/abs/2212.02623">original paper.</a> </small>

* Add UDOP by NielsRogge in 22940

Mamba

This model is a new paradigm architecture based on state-space-models, rather than attention like transformer models.
The checkpoints are compatible with the original ones

* [`Add Mamba`] Adds support for the `Mamba` models by ArthurZucker in 28094

StarCoder2

StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.

* Starcoder2 model - bis by RaymondLi0 in 29215

SegGPT

The SegGPT model was proposed in [SegGPT: Segmenting Everything In Context](https://arxiv.org/abs/2304.03284) by Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang. SegGPT employs a decoder-only Transformer that can generate a segmentation mask given an input image, a prompt image and its corresponding prompt mask. The model achieves remarkable one-shot results with 56.1 mIoU on COCO-20 and 85.6 mIoU on FSS-1000.

* Adding SegGPT by EduardoPach in 27735

Galore optimizer

![image](https://cdn-uploads.huggingface.co/production/uploads/61f4d468587c793cdf55b4dd/RPcpdcYkoUR8PwkTvjYJ0.png)

With [Galore](https://huggingface.co/papers/2403.03507), you can pre-train large models on consumer-type hardwares, making LLM pre-training much more accessible to anyone from the community.

> Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

Galore is based on low rank approximation of the gradients and can be used out of the box for any model.

Below is a simple snippet that demonstrates how to pre-train `mistralai/Mistral-7B-v0.1` on imdb:

python
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
output_dir="./test-galore",
max_steps=100,
per_device_train_batch_size=2,
optim="galore_adamw",
optim_target_modules=["attn", "mlp"]
)

model_id = "mistralai/Mistral-7B-v0.1"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)

trainer.train()


Quantization

Quanto integration

Quanto has been integrated with transformers ! You can apply simple quantization algorithms with few lines of code with tiny changes. Quanto is also compatible with `torch.compile`

Check out [the announcement blogpost](https://huggingface.co/blog/quanto-introduction) for more details

* [Quantization] Quanto quantizer by SunMarc in 29023

Exllama 🤝 AWQ

Exllama and AWQ combined together for faster AWQ inference - check out the relevant documentation section for more details on how to use Exllama + AWQ.

* Exllama kernels support for AWQ models by IlyasMoutawwakil in 28634

MLX Support

Allow models saved or fine-tuned with Apple’s [MLX framework](https://github.com/ml-explore/mlx) to be loaded in transformers (as long as the model parameters use the same names), and improve tensor interoperability. This leverages MLX's adoption of [safetensors](https://huggingface.co/docs/safetensors/en/index) as their checkpoint format.

* Add mlx support to BatchEncoding.convert_to_tensors by Y4hL in 29406
* Add support for metadata format MLX by alexweberk in 29335
* Typo in mlx tensor support by pcuenca in 29509
* Experimental loading of MLX files by pcuenca in 29511

Highligted improvements

Notable memory reduction in Gemma/LLaMa by changing the causal mask buffer type from int64 to boolean.

* Use `torch.bool` instead of `torch.int64` for non-persistant causal mask buffer by fxmarty in 29241

Remote code improvements

* Allow remote code repo names to contain "." by Rocketknight1 in 29175
* simplify get_class_in_module and fix for paths containing a dot by cebtenzzre in 29262

Breaking changes

The PRs below introduced slightly breaking changes that we believed was necessary for the repository; if these seem to impact your usage of transformers, we recommend checking out the PR description to get more insights in how to leverage the new behavior.

* 🚨🚨[Whisper Tok] Update integration test by sanchit-gandhi in 29368
* 🚨 Fully revert atomic checkpointing 🚨 by muellerzr in 29370
* [BC 4.37 -> 4.38] for Llama family, memory and speed 29753 (causal mask is no longer a registered buffer)
Fixes and improvements

* FIX [`Gemma`] Fix bad rebase with transformers main by younesbelkada in 29170
* Add training version check for AQLM quantizer. by BlackSamorez in 29142
* [Gemma] Fix eager attention by sanchit-gandhi in 29187
* [Mistral, Mixtral] Improve docs by NielsRogge in 29084
* Fix `torch.compile` with `fullgraph=True` when `attention_mask` input is used by fxmarty in 29211
* fix(mlflow): check mlflow version to use the synchronous flag by cchen-dialpad in 29195
* Fix missing translation in README_ru by strikoder in 29054
* Improve _update_causal_mask performance by alessandropalla in 29210
* [`Doc`] update model doc qwen2 by ArthurZucker in 29238
* Use torch 2.2 for daily CI (model tests) by ydshieh in 29208
* Cache `is_vision_available` result by bmuskalla in 29280
* Use `DS_DISABLE_NINJA=1` by ydshieh in 29290
* Add `non_device_test` pytest mark to filter out non-device tests by fxmarty in 29213
* Add feature extraction mapping for automatic metadata update by merveenoyan in 28944
* Generate: v4.38 removals and related updates by gante in 29171
* Track each row separately for stopping criteria by zucchini-nlp in 29116
* [docs] Spanish translation of tasks_explained.md by aaronjimv in 29224
* [i18n-zh] Translated torchscript.md into Chinese by windsonsea in 29234
* 🌐 [i18n-ZH] Translate chat_templating.md into Chinese by shibing624 in 28790
* [i18n-vi] Translate README.md to Vietnamese by hoangsvit in 29229
* [i18n-zh] Translated task/asr.md into Chinese by windsonsea in 29233
* Fixed Deformable Detr typo when loading cuda kernels for MSDA by EduardoPach in 29294
* GenerationConfig validate both constraints and force_words_ids by FredericOdermatt in 29163
* Add generate kwargs to VQA pipeline by regisss in 29134
* Cleaner Cache `dtype` and `device` extraction for CUDA graph generation for quantizers compatibility by BlackSamorez in 29079
* Image Feature Extraction docs by merveenoyan in 28973
* Fix `attn_implementation` documentation by fxmarty in 29295
* [tests] enable benchmark unit tests on XPU by faaany in 29284
* Use torch 2.2 for deepspeed CI by ydshieh in 29246
* Add compatibility with skip_memory_metrics for mps device by SunMarc in 29264
* Token level timestamps for long-form generation in Whisper by zucchini-nlp in 29148
* Fix a few typos in `GenerationMixin`'s docstring by sadra-barikbin in 29277
* [i18n-zh] Translate fsdp.md into Chinese by windsonsea in 29305
* FIX [`Gemma` / `CI`] Make sure our runners have access to the model by younesbelkada in 29242
* Remove numpy usage from owlvit by fxmarty in 29326
* [`require_read_token`] fix typo by ArthurZucker in 29345
* [`T5 and Llama Tokenizer`] remove warning by ArthurZucker in 29346
* [`Llama ROPE`] Fix torch export but also slow downs in forward by ArthurZucker in 29198
* Disable Mixtral `output_router_logits` during inference by LeonardoEmili in 29249
* Idefics: generate fix by gante in 29320
* RoPE loses precision for Llama / Gemma + Gemma logits.float() by danielhanchen in 29285
* check if position_ids exists before using it by jiqing-feng in 29306
* [CI] Quantization workflow by SunMarc in 29046
* Better SDPA unmasking implementation by fxmarty in 29318
* [i18n-zh] Sync source/zh/index.md by windsonsea in 29331
* FIX [`CI` / `starcoder2`] Change starcoder2 path to correct one for slow tests by younesbelkada in 29359
* FIX [`CI`]: Fix failing tests for peft integration by younesbelkada in 29330
* FIX [`CI`] `require_read_token` in the llama FA2 test by younesbelkada in 29361
* Avoid using uncessary `get_values(MODEL_MAPPING)` by ydshieh in 29362
* Patch YOLOS and others by NielsRogge in 29353
* Fix require_read_token in tests by Wauplin in 29367
* Expose `offload_buffers` parameter of `accelerate` to `PreTrainedModel.from_pretrained` method by notsyncing in 28755
* Fix Base Model Name of LlamaForQuestionAnswering by lenglaender in 29258
* FIX [`quantization` / `ESM`] Fix ESM 8bit / 4bit with bitsandbytes by younesbelkada in 29329
* [`Llama + AWQ`] fix `prepare_inputs_for_generation` 🫠 by ArthurZucker in 29381
* [`YOLOS`] Fix - return padded annotations by amyeroberts in 29300
* Support subfolder with `AutoProcessor` by JingyaHuang in 29169
* Fix llama + gemma accelete tests by SunMarc in 29380
* Fix deprecated arg issue by muellerzr in 29372
* Correct zero division error in inverse sqrt scheduler by DavidAfonsoValente in 28982
* [tests] enable automatic speech recognition pipeline tests on XPU by faaany in 29308
* update path to hub files in the error message by poedator in 29369
* [Mixtral] Fixes attention masking in the loss by DesmonDay in 29363
* Workaround for 27758 to avoid ZeroDivisionError by tleyden in 28756
* Convert SlimSAM checkpoints by NielsRogge in 28379
* Fix: Fixed the previous tracking URI setting logic to prevent clashes with original MLflow code. by seanswyi in 29096
* Fix OneFormer `post_process_instance_segmentation` for panoptic tasks by nickthegroot in 29304
* Fix grad_norm unserializable tensor log failure by svenschultze in 29212
* Avoid edge case in audio utils by ylacombe in 28836
* DeformableDETR support bfloat16 by DonggeunYu in 29232
* [Docs] Spanish Translation -Torchscript md & Trainer md by njackman-2344 in 29310
* FIX [`Generation`] Fix some issues when running the MaxLength criteria on CPU by younesbelkada in 29317
* Fix max length for BLIP generation by zucchini-nlp in 29296
* [docs] Update starcoder2 paper link by xenova in 29418
* [tests] enable test_pipeline_accelerate_top_p on XPU by faaany in 29309
* [`UdopTokenizer`] Fix post merge imports by ArthurZucker in 29451
* more fix by ArthurZucker (direct commit on main)
* Revert-commit 0d52f9f582efb82a12e8d9162b43a01b1aa0200f by ArthurZucker in 29455
* [`Udop imports`] Processor tests were not run. by ArthurZucker in 29456
* Generate: inner decoding methods are no longer public by gante in 29437
* Fix bug with passing capture_* args to neptune callback by AleksanderWWW in 29041
* Update pytest `import_path` location by loadams in 29154
* Automatic safetensors conversion when lacking these files by LysandreJik in 29390
* [i18n-zh] Translate add_new_pipeline.md into Chinese by windsonsea in 29432
* 🌐 [i18n-KO] Translated generation_strategies.md to Korean by AI4Harmony in 29086
* [FIX] `offload_weight()` takes from 3 to 4 positional arguments but 5 were given by faaany in 29457
* [`Docs` / `Awq`] Add docs on exllamav2 + AWQ by younesbelkada in 29474
* [`docs`] Add starcoder2 docs by younesbelkada in 29454
* Fix TrainingArguments regression with torch <2.0.0 for dataloader_prefetch_factor by ringohoffman in 29447
* Generate: add tests for caches with `pad_to_multiple_of` by gante in 29462
* Generate: get generation mode from the generation config instance 🧼 by gante in 29441
* Avoid dummy token in PLD to optimize performance by ofirzaf in 29445
* Fix test failure on DeepSpeed by muellerzr in 29444
* Generate: torch.compile-ready generation config preparation by gante in 29443
* added the max_matching_ngram_size to GenerationConfig by mosheber in 29131
* Fix `TextGenerationPipeline.__call__` docstring by alvarobartt in 29491
* Substantially reduce memory usage in _update_causal_mask for large batches by using .expand instead of .repeat [needs tests+sanity check] by nqgl in 29413
* Fix: Disable torch.autocast in RotaryEmbedding of Gemma and LLaMa for MPS device by currybab in 29439
* Enable BLIP for auto VQA by regisss in 29499
* v4.39 deprecations 🧼 by gante in 29492
* Revert "Automatic safetensors conversion when lacking these files by LysandreJik in 2…
* fix: Avoid error when fsdp_config is missing xla_fsdp_v2 by ashokponkumar in 29480
* Flava multimodal add attention mask by zucchini-nlp in 29446
* test_generation_config_is_loaded_with_model - fall back to pytorch model for now by amyeroberts in 29521
* Set `inputs` as kwarg in `TextClassificationPipeline` by alvarobartt in 29495
* Fix `VisionEncoderDecoder` Positional Arg by nickthegroot in 29497
* Generate: left-padding test, revisited by gante in 29515
* [tests] add the missing `require_sacremoses` decorator by faaany in 29504
* fix image-to-text batch incorrect output issue by sywangyi in 29342
* Typo fix in error message by clefourrier in 29535
* [tests] use `torch_device` instead of `auto` for model testing by faaany in 29531
* StableLM: Fix dropout argument type error by liangjs in 29236
* Make sliding window size inclusive in eager attention by jonatanklosko in 29519
* fix typos in FSDP config parsing logic in `TrainingArguments` by yundai424 in 29189
* Fix WhisperNoSpeechDetection when input is full silence by ylacombe in 29065
* [tests] use the correct `n_gpu` in `TrainerIntegrationTest::test_train_and_eval_dataloaders` for XPU by faaany in 29307
* Fix eval thread fork bomb by muellerzr in 29538
* feat: use `warning_advice` for tensorflow warning by winstxnhdw in 29540
* [`Mamba doc`] Post merge updates by ArthurZucker in 29472
* [`Docs`] fixed minor typo by j-gc in 29555
* Add Fill-in-the-middle training objective example - PyTorch by tanaymeh in 27464
* Bark model Flash Attention 2 Enabling to pass on check_device_map parameter to super() by damithsenanayake in 29357
* Make torch xla available on GPU by yitongh in 29334
* [Docs] Fix FastSpeech2Conformer model doc links by khipp in 29574
* Don't use a subset in test fetcher if on `main` branch by ydshieh in 28816
* fix error: TypeError: Object of type Tensor is not JSON serializable … by yuanzhoulvpi2017 in 29568
* Add missing localized READMEs to the copies check by khipp in 29575
* Fixed broken link by amritgupta98 in 29558
* Tiny improvement for doc by fzyzcjy in 29581
* Fix Fuyu doc typos by zucchini-nlp in 29601
* Fix minor typo: softare => software by DriesVerachtert in 29602
* Stop passing None to compile() in TF examples by Rocketknight1 in 29597
* Fix typo (determine) by koayon in 29606
* Implemented add_pooling_layer arg to TFBertModel by tomigee in 29603
* Update legacy Repository usage in various example files by Hvanderwilk in 29085
* Set env var to hold Keras at Keras 2 by Rocketknight1 in 29598
* Update flava tests by ydshieh in 29611
* Fix typo ; Update quantization.md by furkanakkurt1335 in 29615
* Add tests for batching support by zucchini-nlp in 29297
* Fix: handle logging of scalars in Weights & Biases summary by parambharat in 29612
* Examples: check `max_position_embeddings` in the translation example by gante in 29600
* [`Gemma`] Supports converting directly in half-precision by younesbelkada in 29529
* [Flash Attention 2] Add flash attention 2 for GPT-J by bytebarde in 28295
* Core: Fix copies on main by younesbelkada in 29624
* [Whisper] Deprecate forced ids for v4.39 by sanchit-gandhi in 29485
* Warn about tool use by LysandreJik in 29628
* Adds pretrained IDs directly in the tests by LysandreJik in 29534
* [generate] deprecate forced ids processor by sanchit-gandhi in 29487
* Fix minor typo: infenrece => inference by DriesVerachtert in 29621
* [`MaskFormer`, `Mask2Former`] Use einsum where possible by amyeroberts in 29544
* Llama: allow custom 4d masks by gante in 29618
* [PyTorch/XLA] Fix extra TPU compilations introduced by recent changes by alanwaketan in 29158
* [docs] Spanish translate chat_templating.md & yml addition by njackman-2344 in 29559
* Add support for FSDP+QLoRA and DeepSpeed ZeRO3+QLoRA by pacman100 in 29587
* [`Mask2Former`] Move normalization for numerical stability by amyeroberts in 29542
* [tests] make `test_trainer_log_level_replica` to run on accelerators with more than 2 devices by faaany in 29609
* Refactor TFP call to just sigmoid() by Rocketknight1 in 29641
* Fix batching tests for new models (Mamba and SegGPT) by zucchini-nlp in 29633
* Fix `multi_gpu_data_parallel_forward` for `MusicgenTest` by ydshieh in 29632
* [docs] Remove broken ChatML format link from chat_templating.md by aaronjimv in 29643
* Add newly added PVTv2 model to all README files. by robinverduijn in 29647
* [`PEFT`] Fix `save_pretrained` to make sure adapters weights are also saved on TPU by shub-kris in 29388
* Fix TPU checkpointing inside Trainer by shub-kris in 29657
* Add `dataset_revision` argument to `RagConfig` by ydshieh in 29610
* Fix PVT v2 tests by ydshieh in 29660
* Generate: handle `cache_position` update in `generate` by gante in 29467
* Allow apply_chat_template to pass kwargs to the template and support a dict of templates by Rocketknight1 in 29658
* Inaccurate code example within inline code-documentation by MysteryManav in 29661
* Extend import utils to cover "editable" torch versions by bhack in 29000
* Trainer: fail early in the presence of an unsavable `generation_config` by gante in 29675
* Pipeline: use tokenizer pad token at generation time if the model pad token is unset. by gante in 29614
* [tests] remove deprecated tests for model loading by faaany in 29450
* Fix AutoformerForPrediction example code by m-torhan in 29639
* [tests] ensure device-required software is available in the testing environment before testing by faaany in 29477
* Fix wrong condition used in `filter_models` by ydshieh in 29673
* fix: typos by testwill in 29653
* Rename `glue` to `nyu-mll/glue` by lhoestq in 29679
* Generate: replace breaks by a loop condition by gante in 29662
* [FIX] Fix speech2test modeling tests by ylacombe in 29672
* Revert "Fix wrong condition used in `filter_models`" by ydshieh in 29682
* [docs] Spanish translation of attention.md by aaronjimv in 29681
* CI / generate: batch size computation compatible with all models by gante in 29671
* Fix `filter_models` by ydshieh in 29710
* FIX [`bnb`] Make `unexpected_keys` optional by younesbelkada in 29420
* Update the pipeline tutorial to include `gradio.Interface.from_pipeline` by abidlabs in 29684
* Use logging.warning instead of warnings.warn in pipeline.__call__ by tokestermw in 29717

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* windsonsea
* [i18n-zh] Translated torchscript.md into Chinese (29234)
* [i18n-zh] Translated task/asr.md into Chinese (29233)
* [i18n-zh] Translate fsdp.md into Chinese (29305)
* [i18n-zh] Sync source/zh/index.md (29331)
* [i18n-zh] Translate add_new_pipeline.md into Chinese (29432)
* hoangsvit
* [i18n-vi] Translate README.md to Vietnamese (29229)
* EduardoPach
* Fixed Deformable Detr typo when loading cuda kernels for MSDA (29294)
* Adding SegGPT (27735)
* RaymondLi0
* Starcoder2 model - bis (29215)
* njackman-2344
* [Docs] Spanish Translation -Torchscript md & Trainer md (29310)
* [docs] Spanish translate chat_templating.md & yml addition (29559)
* tanaymeh
* Add Fill-in-the-middle training objective example - PyTorch (27464)
* Hvanderwilk
* Update legacy Repository usage in various example files (29085)
* FoamoftheSea
* Add PvT-v2 Model (26812)
* saurabhdash2512
* Cohere Model Release (29622)

1.5

The Phi-1 model was proposed in [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.

The Phi-1.5 model was proposed in [Textbooks Are All You Need II: phi-1.5 technical report](https://arxiv.org/abs/2309.05463) by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.

* Add Phi-1 and Phi-1_5 by susnato in 26170

TVP

The text-visual prompting (TVP) framework was proposed in the paper [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2303.04995) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.

This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as ‘prompts’, into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model’s ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently.

* TVP model by jiqing-feng in 25856

DINOv2 depth estimation

Depth estimation is added to the DINO v2 implementation.

* Add DINOv2 depth estimation by NielsRogge in 26092

ROCm support for AMD GPUs

AMD's ROCm GPU architecture is [now supported across the board](https://huggingface.co/blog/huggingface-and-optimum-amd) and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as [Flash Attention 2](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2), [GPTQ quantization](https://huggingface.co/docs/transformers/quantization#autogptq) and DeepSpeed.

* Add RoCm scheduled CI & upgrade RoCm CI to PyTorch 2.1 by fxmarty in 26940
* Flash Attention 2 support for RoCm by fxmarty in 27611
* Reflect RoCm support in the documentation by fxmarty in 27636
* restructure AMD scheduled CI by ydshieh in 27743

PyTorch `scaled_dot_product_attention` native support

PyTorch's [`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html) operator is now supported [in the most-used Transformers models](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention) and **used by default when using `torch>=2.1.1`**, allowing to dispatch on [memory-efficient attention and Flash Attention](https://pytorch.org/blog/accelerating-large-language-models/) backend implementations with no other package than `torch` required. This should [significantly speed up](https://pytorch.org/blog/out-of-the-box-acceleration/) attention computation on hardware that that supports these fastpath.

While Transformers automatically handles the dispatch to use SDPA when available, it is possible to force the usage of a given attention implementation (`"eager"` being the manual implementation, where each operation is implemented [step by step](https://github.com/huggingface/transformers/blob/9f18cc6df0b7e0d50f78b9e9fcb3aafa7b5160fe/src/transformers/models/llama/modeling_llama.py#L413-L431)):
python
or `attn_implementation="sdpa", or `attn_implementation="flash_attention_2"`
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny", attn_implementation="eager")


**[Training benchmark](https://gist.github.com/fxmarty/7e75cc3942d6974e4849093ebea0a331), run on A100-SXM4-80GB.**

| Model | Batch size | Sequence length | Time per batch (`"eager"`, s) | Time per batch (`"sdpa"`, s) | **Speedup** | Peak memory (`"eager"`, MB) | Peak memory (`"sdpa"`, MB) | **Memory savings** |
|-----------|------------|-----------------|-------------------------------|------------------------------|-------------|-----------------------------|----------------------------|-----------------------|
| llama2 7b | 4 | 1024 | 1.065 | 0.90 | **19.4%** | 73878.28 | 45977.81 | **60.7%** |
| llama2 7b | 4 | 2048 | OOM | 1.87 | / | OOM | 78394.58 | **SDPA does not OOM** |
| llama2 7b | 1 | 2048 | 0.64 | 0.48 | **32.0%** | 55557.01 | 29795.63 | **86.4%** |
| llama2 7b | 1 | 3072 | OOM | 0.75 | / | OOM | 37916.08 | **SDPA does not OOM** |
| llama2 7b | 1 | 4096 | OOM | 1.03 | / | OOM | 46028.14 | **SDPA does not OOM** |
| llama2 7b | 2 | 4096 | OOM | 2.05 | / | OOM | 78428.14 | **SDPA does not OOM** |

**[Inference benchmark](https://gist.github.com/fxmarty/5113e4304fbdd38c9c3702ce44683f6a), run on A100-SXM4-80GB.**

| Model | Batch size | Prompt length | Num new tokens | Per token latency `"eager"` (ms) | Per token latency `"sdpa"` (ms) | **Speedup** |
|------------------|------------|---------------|----------------|----------------------------------|---------------------------------|-------------|
| llama2 13b | 1 | 1024 | 1 (prefill) | 178.66 | 159.36 | **12.11%** |
| llama2 13b | 1 | 100 | 100 | 40.35 | 37.62 | **7.28%** |
| llama2 13b | 8 | 100 | 100 | 40.55 | 38.06 | **6.53%** |
| Whisper v3 large | 1 | / | 62 | 20.05 | 18.90 | **6.10%** |
| Whisper v3 large | 8 | / | 77 | 25.42 | 24.77 | **2.59%** |
| Whisper v3 large | 16 | / | 77 | 28.51 | 26.32 | **8.34%** |

* F.scaled_dot_product_attention support by fxmarty in 26572

New Cache abstraction & Attention Sinks support

We are rolling out a new abstraction for the `past_key_values` cache, which enables the use of different types of caches. For now, only `llama` and `llama`-inspired architectures (`mistral`, `persimmon`, `phi`) support it, with other architectures scheduled to have support in the next release. By default, a growing cache (`DynamicCache`) is used, which preserves the existing behavior.

This release also includes a new `SinkCache` cache, which implements the [Attention Sinks paper](https://arxiv.org/abs/2309.17453). With `SinkCache`, the model is able to continue generating high-quality text well beyond its training sequence length! Note that it does not expand the context window, so it can’t digest very long inputs — it is suited for streaming applications such as multi-round dialogues. Check this [colab](https://colab.research.google.com/drive/1S0oIPaqxAVp0oWEwTadhZXDjhWiTyF12?usp=sharing) for an example.

![image](https://github.com/huggingface/transformers/assets/12240844/c6fd4077-b884-4d9c-be55-324474a1cc76)

* Generate: New `Cache` abstraction and Attention Sinks support by tomaarsen in 26681
* Generate: SinkCache can handle iterative prompts by gante in 27907

Safetensors as a default

We continue toggling features enabling safetensors as a default across the board, in PyTorch, Flax, and TensorFlow.
When using PyTorch model and forcing the load of `safetensors` file with `use_safetensors=True`, if the repository does not contain the safetensors files, they will now be converted on-the-fly server-side.

* Default to msgpack for safetensors by LysandreJik in 27460
* Fix `from_pt` flag when loading with `safetensors` by LysandreJik in 27394
* Make using safetensors files automated. by Narsil in 27571

Breaking changes

pickle files

We now disallow the use of `pickle.load` internally for security purposes. To circumvent this, you can use the `TRUST_REMOTE_CODE=True` command to indicate that you would still like to load it.

* 🚨🚨🚨 Disallow `pickle.load` unless `TRUST_REMOTE_CODE=True` by ydshieh in 27776


Beam score calculation for decoder-only models

In the previous implementation of beam search, when `length_penalty` is active, the beam score for decoder-only models was penalized by the total length of both prompt and generated sequence. However, the length of prompt should not be included in the penalization step -- this release fixes it.

* 🚨🚨 Fix beam score calculation issue for decoder-only models by VsonicV in 27351
* Fix remaining issues in beam score calculation by VsonicV in 27808
* Fix beam score calculation issue for Tensorflow version by VsonicV in 27814
* Fix beam score calculation issue for JAX version by VsonicV in 27816

Slight API changes/corrections

* ⚠️ [VitDet] Fix test by NielsRogge in 27832
* [⚠️ removed a default argument] Make `AttentionMaskConverter` compatible with `torch.compile(..., fullgraph=True)` by fxmarty in 27868

Bugfixes and improvements

* Enrich TTS pipeline parameters naming by ylacombe in 26473
* translate peft.md to chinese by jiaqiw09 in 27215
* Removed the redundant SiLUActivation class. by hi-sushanta in 27136
* Fixed base model class name extraction from PeftModels by kkteru in 27162
* Fuyu protection by LysandreJik in 27248
* Refactor: Use Llama RoPE implementation for Falcon by tomaarsen in 26933
* [`PEFT` / `Tests` ] Fix peft integration failing tests by younesbelkada in 27258
* Avoid many failing tests in doctesting by ydshieh in 27262
* [docs] Custom model doc update by MKhalusova in 27213
* Update the ConversationalPipeline docstring for chat templates by Rocketknight1 in 27250
* Fix switch transformer mixed precision issue by timlee0212 in 27220
* [`Docs` / `SAM` ] Reflect correct changes to run inference without OOM by younesbelkada in 27268
* [Docs] Model_doc structure/clarity improvements by MKhalusova in 26876
* [`FA2`] Add flash attention for for `DistilBert` by susnato in 26489
* translate autoclass_tutorial to chinese by jiaqiw09 in 27269
* translate run_scripts.md to chinese by jiaqiw09 in 27246
* Fix tokenizer export for LLamaTokenizerFast by mayank31398 in 27222
* Fix daily CI image build by ydshieh in 27307
* Update doctest workflow file by ydshieh in 27306
* Remove an unexpected argument for FlaxResNetBasicLayerCollection by pingzhili in 27272
* enable memory tracker metrics for npu by statelesshz in 27280
* [`PretrainedTokenizer`] add some of the most important functions to the doc by ArthurZucker in 27313
* Update sequence_classification.md by akshayvkt in 27281
* Fix VideoMAEforPretrained dtype error by ikergarcia1996 in 27296
* Fix `Kosmos2Processor` batch mode by ydshieh in 27323
* [docs] fixed links with 404 by MKhalusova in 27327
* [Whisper] Block language/task args for English-only by sanchit-gandhi in 27322
* Fix autoawq docker image by younesbelkada in 27339
* Generate: skip tests on unsupported models instead of passing by gante in 27265
* Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function by zuazo in 26834
* [`FA2`] Add flash attention for `GPT-Neo` by susnato in 26486
* [`Whisper`] Add conversion script for the tokenizer by ArthurZucker in 27338
* Remove a redundant variable. by hi-sushanta in 27288
* Resolve AttributeError by utilizing device calculation at the start of the forward function by folbaeni in 27347
* Remove padding_masks from `gpt_bigcode`. by susnato in 27348
* [`Whisper`] Nit converting the tokenizer by ArthurZucker in 27349
* FIx Bark batching feature by ylacombe in 27271
* Allow scheduler parameters by Plemeur in 26480
* translate the en tokenizer_summary.md to Chinese by ZouJiu1 in 27291
* translate model_sharing.md and llm_tutorial.md to chinese by jiaqiw09 in 27283
* Add numpy alternative to FE using torchaudio by ylacombe in 26339
* moving example of benchmarking to legacy dir by statelesshz in 27337
* Fix example tests from failing by muellerzr in 27353
* Fix `Kosmos-2` device issue by ydshieh in 27346
* MusicGen Update by sanchit-gandhi in 27084
* Translate index.md to Turkish by mertyyanik in 27093
* Remove unused param from example script tests by muellerzr in 27354
* [Flax Whisper] large-v3 compatibility by sanchit-gandhi in 27360
* Fix tiny model script: not using `from_pt=True` by ydshieh in 27372
* translate big_models.md and performance.md to chinese by jiaqiw09 in 27334
* Add Flash Attention 2 support to Bark by ylacombe in 27364
* Update deprecated `torch.range` in `test_modeling_ibert.py` by kit1980 in 27355
* translate debugging.md to chinese by jiaqiw09 in 27374
* Smangrul/fix failing ds ci tests by pacman100 in 27358
* [`CodeLlamaTokenizer`] Nit, update __init__ to make sure the AddedTokens are not normalized because they are special by ArthurZucker in 27359
* Change thresh in test by muellerzr in 27378
* Put doctest options back to `pyproject.toml` by ydshieh in 27366
* Skip failing cache call tests by amyeroberts in 27393
* device-agnostic deepspeed testing by statelesshz in 27342
* Adds dvclive callback by dberenbaum in 27352
* use `pytest.mark` directly by ydshieh in 27390
* Fix fuyu checkpoint repo in `FuyuConfig` by ydshieh in 27399
* Use editable install for git deps by muellerzr in 27404
* Final fix of the accelerate installation issue by ydshieh in 27408
* Fix RequestCounter to make it more future-proof by Wauplin in 27406
* remove failing tests and clean FE files by ylacombe in 27414
* Fix `Owlv2` checkpoint name and a default value in `Owlv2VisionConfig` by ydshieh in 27402
* Run all tests if `circleci/create_circleci_config.py` is modified by ydshieh in 27413
* add attention_mask and position_ids in assisted model by jiqing-feng in 26892
* [`Quantization`] Add str to enum conversion for AWQ by younesbelkada in 27320
* update Bark FA2 docs by ylacombe in 27400
* [`AttentionMaskConverter`] ]Fix-mask-inf by ArthurZucker in 27114
* At most 2 GPUs for CI by ydshieh in 27435
* Normalize floating point cast by amyeroberts in 27249
* Make `examples_torch_job` faster by ydshieh in 27437
* Fix line ending in `utils/not_doctested.txt` by ydshieh in 27459
* Fix some Wav2Vec2 related models' doctest by ydshieh in 27462
* Fixed typo in error message by cmcmaster1 in 27461
* Remove-auth-token by ArthurZucker in 27060
* [`Llama + Mistral`] Add attention dropout by ArthurZucker in 27315
* OWLv2: bug fix in post_process_object_detection() when using cuda device by assafbot in 27468
* Fix docstring for `gradient_checkpointing_kwargs` by tomaszcichy98 in 27470
* Install `python-Levenshtein` for `nougat` in CI image by ydshieh in 27465
* Add version check for Jinja by Rocketknight1 in 27403
* Fix Falcon tokenizer loading in pipeline by Rocketknight1 in 27316
* [`AWQ` ] Addresses TODO for awq tests by younesbelkada in 27467
* Perf torch compile by jiaqiw09 in 27422
* Fixed typo in pipelines.md documentation by adismort14 in 27455
* Fix FA2 import + deprecation cycle by SunMarc in 27330
* [`Peft`] `modules_to_save` support for peft integration by younesbelkada in 27466
* [`CI-test_torch`] skip `test_tf_from_pt_safetensors` for 4 models by ArthurZucker in 27481
* Fix M4T weights tying by ylacombe in 27395
* Add speecht5 batch generation and fix wrong attention mask when padding by Spycsh in 25943
* Clap processor: remove wasteful np.stack operations by m-bain in 27454
* [Whisper] Fix pipeline test by sanchit-gandhi in 27442
* Revert "[time series] Add PatchTST by amyeroberts in 25927)"
* translate hpo_train.md and perf_hardware.md to chinese by jiaqiw09 in 27431
* Generate: fix `ExponentialDecayLengthPenalty` doctest by gante in 27485
* Update and reorder docs for chat templates by Rocketknight1 in 27443
* Generate: `GenerationConfig.from_pretrained` can return unused kwargs by gante in 27488
* Minor type annotation fix by vwxyzjn in 27276
* Have seq2seq just use gather by muellerzr in 27025
* Update processor mapping for hub snippets by amyeroberts in 27477
* Track the number of tokens seen to metrics by muellerzr in 27274
* [`CI-test_torch`] skip test_tf_from_pt_safetensors and `test_assisted_decoding_sample` by ArthurZucker in 27508
* [Fuyu] Add tests by NielsRogge in 27001
* [Table Transformer] Add Transformers-native checkpoints by NielsRogge in 26928
* Update spelling mistake by LimJing7 in 27506
* [`CircleCI`] skip test_assisted_decoding_sample for everyone by ArthurZucker in 27511
* Make some jobs run on the GitHub Actions runners by ydshieh in 27512
* [`tokenizers`] update `tokenizers` version pin by ArthurZucker in 27494
* [ `PretrainedConfig`] Improve messaging by ArthurZucker in 27438
* Fix wav2vec2 params by muellerzr in 27515
* Translating `en/model_doc` docs to Japanese. by Yuki-Imajuku in 27401
* Fixing the failure of models without max_position_embeddings attribute. by AdamLouly in 27499
* Incorrect setting for num_beams in translation and summarization examples by Rocketknight1 in 27519
* Fix bug for T5x to PyTorch convert script with varying encoder and decoder layers by JamesJiang97 in 27448
* Fix offload disk for loading derivated model checkpoint into base model by SunMarc in 27253
* translate model.md to chinese by statelesshz in 27518
* Support ONNX export for causal LM sequence classifiers by dwyatte in 27450
* [`pytest`] Avoid flash attn test marker warning by ArthurZucker in 27509
* docs: add docs for map, and add num procs to load_dataset by pphuc25 in 27520
* Update the TF pin for 2.15 by Rocketknight1 in 27375
* Revert "add attention_mask and position_ids in assisted model" by patrickvonplaten in 27523
* Set `usedforsecurity=False` in hashlib methods (FIPS compliance) by Wauplin in 27483
* Raise error when quantizing a quantized model by SunMarc in 27500
* Disable docker image build job `latest-pytorch-amd` for now by ydshieh in 27541
* [`Styling`] stylify using ruff by ArthurZucker in 27144
* Generate: improve assisted generation tests by gante in 27540
* Updated albert.md doc for ALBERT model by ENate in 27223
* translate Trainer.md to chinese by jiaqiw09 in 27527
* Skip some fuyu tests by ydshieh in 27553
* Fix AMD CI not showing GPU by ydshieh in 27555
* Generate: fix flaky tests by gante in 27543
* Generate: update compute transition scores doctest by gante in 27558
* fixed broken link by VpkPrasanna in 27560
* Broken links fixed related to datasets docs by VpkPrasanna in 27569
* translate deepspeed.md to chinese by jiaqiw09 in 27495
* Fix broken distilbert url by osanseviero in 27579
* Adding leaky relu in dict ACT2CLS by rafaelpadilla in 27574
* Fix idx2sym not loaded from pretrained vocab file in Transformer XL by jtang98 in 27589
* Add `convert_hf_to_openai.py` script to Whisper documentation resources by zuazo in 27590
* docs: fix 404 link by panpan0000 in 27529
* [ examples] fix loading jsonl with load dataset in run translation example by mathiasesn in 26924
* [`FA-2`] Add fa2 support for `from_config` by younesbelkada in 26914
* timm to pytorch conversion for vit model fix by staghado in 26908
* [Whisper] Add `large-v3` version support by flyingleafe in 27336
* Update Korean tutorial for using LLMs, and refactor the nested conditional statements in hr_argparser.py by YeonwooSung in 27489
* Fix torch.fx import issue for torch 1.12 by amyeroberts in 27570
* dvclive callback: warn instead of fail when logging non-scalars by dberenbaum in 27608
* [`core` / `gradient_checkpointing`] add support for old GC method by younesbelkada in 27610
* [ConvNext] Improve backbone by NielsRogge in 27621
* Generate: Update docs regarding reusing `past_key_values` in `generate` by gante in 27612
* Idefics: Fix information leak with cross attention gate in modeling by leot13 in 26839
* Fix flash attention bugs with Mistral and Falcon by fxmarty in 27625
* Fix tracing dinov2 by amyeroberts in 27561
* remove the deprecated method `init_git_repo` by statelesshz in 27617
* Explicitely specify `use_cache=True` in Flash Attention tests by fxmarty in 27635
* Harmonize HF environment variables + other cleaning by Wauplin in 27564
* Fix `resize_token_embeddings` by czy-orange in 26861)
* [`dependency`] update pillow pins by ArthurZucker in 27409
* Simplify the implementation of jitter noise in moe models by jiangwangyi in 27643
* Fix `max_steps` documentation regarding the end-of-training condition by qgallouedec in 27624
* [Whisper] Add sequential longform decoding by patrickvonplaten in 27492
* Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration by dg845 in 24799
* update Openai API call method by Strive-for-excellence in 27628
* update d_kv'annotation in mt5'configuration by callanwu in 27585
* [`FA2`] Add flash attention for opt by susnato in 26414
* Extended semantic segmentation to image segmentation by merveenoyan in 27039
* Update TVP arxiv link by amyeroberts in 27672
* [DPT, Dinov2] Add resources by NielsRogge in 27655
* Update tiny model summary file by ydshieh in 27388
* Refactoring Trainer, adds `save_only_model` arg and simplifying FSDP integration by pacman100 in 27652
* Skip pipeline tests for 2 models for now by ydshieh in 27687
* Deprecate `TransfoXL` by ydshieh in 27607
* Fix typo in warning message by liuxueyang in 27055
* Docs/Add conversion code to the musicgen docs by yoinked-h in 27665
* Fix semantic error in evaluation section by anihm136 in 27675
* [`DocString`] Support a revision in the docstring `add_code_sample_docstrings` to facilitate integrations by ArthurZucker in 27645
* Successfully Resolved The ZeroDivisionError Exception. by hi-sushanta in 27524
* Fix `TVPModelTest` by ydshieh in 27695
* Fix sliding_window hasattr in Mistral by IlyaGusev in 27041
* Fix Past CI by ydshieh in 27696
* fix warning by ArthurZucker in 27689
* Reorder the code on the Hub to explicit that sharing on the Hub isn't a requirement by LysandreJik in 27691
* Fix mistral generate for long prompt / response by lorabit110 in 27548
* Fix oneformer instance segmentation RuntimeError by yhshin11 in 27725
* fix assisted decoding assistant model inputs by jiqing-feng in 27503
* Update forward signature test for vision models by NielsRogge in 27681
* Modify group_sub_entities in TokenClassification Pipeline to support label with "-" by eshoyuan in 27325
* Fix owlv2 code snippet by NielsRogge in 27698
* docs: replace torch.distributed.run by torchrun by panpan0000 in 27528
* Update chat template warnings/guides by Rocketknight1 in 27634
* translation main-class files to chinese by jiaqiw09 in 27588
* Translate `en/model_doc` to JP by rajveer43 in 27264
* Fixed passing scheduler-specific kwargs via TrainingArguments lr_scheduler_kwargs by CharbelAD in 27595
* Fix AMD Push CI not triggered by ydshieh in 27732
* Add BeitBackbone by NielsRogge in 25952
* Update tiny model creation script by ydshieh in 27674
* Log a warning in `TransfoXLTokenizer.__init__` by ydshieh in 27721
* Add madlad-400 MT models by jbochi in 27471
* Enforce pin memory disabling when using cpu only by qgallouedec in 27745
* Trigger corresponding pipeline tests if `tests/utils/tiny_model_summary.json` is modified by ydshieh in 27693
* CLVP Fixes by susnato in 27547
* Docs: Fix broken cross-references, i.e. `~transformer.` -> `~transformers.` by tomaarsen in 27740
* [docs] Quantization by stevhliu in 27641
* Fix precision errors from casting rotary parameters to FP16 with AMP by kevinhu in 27700
* Remove `check_runner_status.yml` by ydshieh in 27767
* uses dvclive_test mode in examples/pytorch/test_accelerate_examples.py by dberenbaum in 27763
* Generate: `GenerationConfig` throws an exception when `generate` args are passed by gante in 27757
* Fix unsupported setting of self._n_gpu in training_args on XPU devices by Liangliang-Ma in 27716
* [SeamlessM4Tv2] Fix links in README by xenova in 27782
* [i18n-fr] Translate installation to French by NoB0 in 27657
* Fixes for PatchTST Config by wgifford in 27777
* Better error message for bitsandbytes import by SunMarc in 27764
* [MusicGen] Fix audio channel attribute by sanchit-gandhi in 27440
* [JAX] Replace uses of jax.devices("cpu") with jax.local_devices(backend="cpu") by hvaara in 27593
* Improve forward signature test by NielsRogge in 27729
* Fix typo in max_length deprecation warnings by siegeln in 27788
* Add `persistent_workers` parameter to `TrainingArguments` by Sorrow321 in 27189
* [`ModelOnTheFlyConversionTester`] Mark as slow for now by ArthurZucker in 27823
* Fix `TvpModelIntegrationTests` by ydshieh in 27792
* Fix `Owlv2ModelIntegrationTest::test_inference_object_detection` by ydshieh in 27793
* Keypoints 0.0 are confusing ../transformers/models/detr/image_processing_detr.py which are fixed by hackpk in 26250
* [Seamless v1] Link to v2 docs by sanchit-gandhi in 27827
* [Whisper] Fix doctest in timestamp logits processor by sanchit-gandhi in 27795
* Added test cases for rembert refering to albert and reformer test_tok… by nileshkokane01 in 27637
* [Hot-Fix][XLA] Re-enable broken _tpu_save for XLATensors by yeounoh in 27799
* single word should be set to False by ArthurZucker in 27738
* [Seamless v2] Add FE to auto mapping by sanchit-gandhi in 27829
* translate internal folder files to chinese by jiaqiw09 in 27638
* Translate `en/tasks` folder docs to Japanese 🇯🇵 by rajveer43 in 27098
* pin `ruff==0.1.5` by ydshieh in 27849
* Make image processors more general by NielsRogge in 27690
* Faster generation using AWQ + Fused modules by younesbelkada in 27411
* Generate: Update VisionEncoderDecoder test value by gante in 27850
* [`ClipVision`] `accelerate` support for clip-vision by younesbelkada in 27851
* Add Llama Flax Implementation by vvvm23 in 24587
* Move tensors to same device to enable IDEFICS naive MP training by willemsenbram in 27746
* Update `VitDetModelTester.get_config` to use `pretrain_image_size` by ydshieh in 27831
* fix(whisper): mutable generation config by badayvedat in 27833
* Documentation: Spanish translation of perplexity.mdx by aaronjimv in 27807
* [`Docs`] Update broken image on fused modules by younesbelkada in 27856
* Update CUDA versions for DeepSpeed by muellerzr in 27853
* removed the delete doc workflows by MKhalusova in 27852
* Avoid class attribute `_keep_in_fp32_modules` being modified by ydshieh in 27867
* [`Flash Attention 2`] Add flash attention 2 for GPT-Neo-X by younesbelkada in 26463
* Translating en/model_doc folder docs to Japanese(from `blip` to `clap`) 🇯🇵 by rajveer43 in 27673
* Fix bug of _prepare_4d_attention_mask by jiqing-feng in 27847
* [i18n-fr] Translate autoclass tutorial to French by NoB0 in 27659
* [`FA-2`] Add Flash Attention to `Phi` by susnato in 27661
* fix: fix gradient accumulate step for learning rate by pphuc25 in 27667
* Allow ` Ignore copy` by ydshieh in 27328
* update `create_model_card` to properly save peft details when using Trainer with PEFT by pacman100 in 27754
* update version of warning notification for `get_default_device` to v4.38 by statelesshz in 27848
* Fix device of masks in tests by fxmarty in 27887
* Show new failing tests in a more clear way in slack report by ydshieh in 27881
* Fix TF loading PT safetensors when weights are tied by Rocketknight1 in 27490
* Generate: All logits processors are documented and have examples by gante in 27796
* [docs] Custom semantic segmentation dataset by stevhliu in 27859
* Updates the distributed CPU training documentation to add instructions for running on a Kubernetes cluster by dmsuehir in 27780
* Translate `model_doc` files from `clip` to `cpm` to JP by rajveer43 in 27774
* Fix: Raise informative exception when `prefix_allowed_tokens_fn` return empty set of tokens by Saibo-creator in 27797
* Added passing parameters to "reduce_lr_on_plateau" scheduler by CharbelAD in 27860
* fix: non-atomic checkpoint save by thundergolfer in 27820
* Fix CLAP converting script by ylacombe in 27153
* mark `test_initialization` as flaky in 2 model tests by ydshieh in 27906
* Fix `notification_service.py` by ydshieh in 27903
* Fix 2 tests in `FillMaskPipelineTests` by ydshieh in 27889
* Llama conversion script: adjustments for Llama Guard by pcuenca in 27910
* fix llava by ArthurZucker in 27909
* Allow `resume_from_checkpoint` to handle `auto_find_batch_size` by muellerzr in 27568
* [Doc] Spanish translation of pad_truncation.md by aaronjimv in 27890
* fix typo in image_processing_blip.py Wwhether -> Whether by zhc7 in 27899
* [CLAP] Replace hard-coded batch size to enable dynamic ONNX export by xenova in 27790
* [integration] Update Ray Tune integration for Ray 2.7 by justinvyu in 26499
* Fix typo by f4hy in 27918
* [DETA] fix backbone freeze/unfreeze function by SangbumChoi in 27843

Significant community contributions

The following contributors have made significant changes to the library over the last release:

* jiaqiw09
* translate peft.md to chinese (27215)
* translate autoclass_tutorial to chinese (27269)
* translate run_scripts.md to chinese (27246)
* translate model_sharing.md and llm_tutorial.md to chinese (27283)
* translate big_models.md and performance.md to chinese (27334)
* translate debugging.md to chinese (27374)
* Perf torch compile (27422)
* translate hpo_train.md and perf_hardware.md to chinese (27431)
* translate Trainer.md to chinese (27527)
* translate deepspeed.md to chinese (27495)
* translation main-class files to chinese (27588)
* translate internal folder files to chinese (27638)
* susnato
* [`FA2`] Add flash attention for for `DistilBert` (26489)
* [`FA2`] Add flash attention for `GPT-Neo` (26486)
* Remove padding_masks from `gpt_bigcode`. (27348)
* Add CLVP (24745)
* Add Phi-1 and Phi-1_5 (26170)
* [`FA2`] Add flash attention for opt (26414)
* CLVP Fixes (27547)
* [`FA-2`] Add Flash Attention to `Phi` (27661)
* jiqing-feng
* add attention_mask and position_ids in assisted model (26892)
* TVP model (25856)
* fix assisted decoding assistant model inputs (27503)
* Fix bug of _prepare_4d_attention_mask (27847)
* psinthong
* [time series] Add PatchTST (25927)
* Yuki-Imajuku
* Translating `en/model_doc` docs to Japanese. (27401)
* dg845
* Add UnivNet Vocoder Model for Tortoise TTS Diffusers Integration (24799)
* rajveer43
* Translate `en/model_doc` to JP (27264)
* Translate `en/tasks` folder docs to Japanese 🇯🇵 (27098)
* Translating en/model_doc folder docs to Japanese(from `blip` to `clap`) 🇯🇵 (27673)
* Translate `model_doc` files from `clip` to `cpm` to JP (27774)
* NoB0
* [i18n-fr] Translate installation to French (27657)
* [i18n-fr] Translate autoclass tutorial to French (27659)
* ajati
* [Time series] Add PatchTSMixer (26247)
* vvvm23
* Add Llama Flax Implementation (24587)

1.2.0

New model architecture: DistilBERT

Huggingface's new transformer architecture, **DistilBERT** described in [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5) by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

- `distilbert-base-uncased`: the base DistilBert model
- `distilbert-base-uncased-distilled-squad `: DistilBert model fine-tuned with distillation on SQuAD.

New GPT2 checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual checkpoints: 17 & 100 languages

We have added two new [XLM models in 17 and 100 languages](https://github.com/facebookresearch/XLM#pretrained-cross-lingual-language-models) which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

Back on `torch.hub` with all the architectures

Pytorch-Transformers `torch.hub` interface is based on Auto-Models which are generic classes designed to be instantiated using `from_pretrained()` in a model architecture guessed from the pretrained checkpoint name (ex `AutoModel.from_pretrained('bert-base-uncased') will instantiate a `BertModel` and load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models: `AutoModel`, `AutoModelWithLMHead`, `AutoModelForSequenceClassification` and `AutoModelForQuestionAnswering`.

New dependency: `sacremoses`

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by shijie-wu in 1092). We now rely on [`sacremoses`](https://github.com/alvations/sacremoses), a python port of Moses tokenizer, truecaser and normalizer by alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *

\* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules

- Bertology script has seen major improvements (tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (LysandreJik )
- Added GPT-2 Large 774M model (thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (LysandreJik thomwolf )
- Multi-GPU training has been patched (FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (Morizeyao, adai183 )
- Updated the in-depth BERT fine-tuning scripts to `pytorch-transformers` (Morizeyao )
- Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (LysandreJik thomwolf)
- Add `proxies` and `force_download` options to `from_pretrained()` method to be able to use proxies and update cached models/tokenizers (thomwolf)
- Add shortcut to each special tokens with `_id` properties (e.g. `tokenizer.cls_token_id` for the id in the vocabulary of `tokenizer.cls_token`) (thomwolf)
- Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by [fairseq authors](https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py#L38-L56)) (thomwolf)
- Fix and clean up byte-level BPE tests (thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (LysandreJik )
- Fix a warning raised when the decode method is called for a model with no `sep_token` like GPT-2 (LysandreJik )
- Updated the tokenizers saving method (boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (GuillemGSubies )
- Stable `EnvironmentErrors` have been added to utility files (abhishekraok )
- Fixed distributed barrier hang (VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (LysandreJik )
- Change layer norm code to PyTorch's native layer norm (dhpollack)
- Improved tokenization for XLM for multilingual inputs (shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (thomwolf)
- Added new AutoModels: `AutoModelWithLMHead`, `AutoModelForSequenceClassification`, `AutoModelForQuestionAnswering` (LysandreJik)
- Torch.hub is now based on AutoModels (LysandreJik thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (CrafterKolyan)
- Adding DistilBert model architecture and checkpoints (VictorSanh LysandreJik thomwolf)
- Fixes to DistilBert configuration and training script (stefan-it)
- Fix XLNet attention mask for fp16 (ziliwang)
- Documentation auto-deploy (LysandreJik)
- Fix to add a tuple of tokens (epwalsh)
- Update fp16 apex implementation in scripts (anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (LysandreJik)
- Fix tokenizer reloading in example scripts (rabeehk)
- Fix byte-level decoding error when using added tokens (thomwolf LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (julien-c)

Page 24 of 26

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.