Model releases
Qwen2
Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.
Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.
* Add qwen2 by JustinLin610 in 28436
Phi-2
Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.
* [Phi2] Add support for phi2 models by susnato in 28211
* [Phi] Extend implementation to use GQA/MQA. by gugarosa in 28163
* update docs to add the `phi-2` example by susnato in 28392
* Fixes default value of `softmax_scale` in `PhiFlashAttention2`. by gugarosa in 28537
SigLIP
The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.
* Add SigLIP by NielsRogge in 26522
* [SigLIP] Don't pad by default by NielsRogge in 28578
ViP-LLaVA
The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.
VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.
* Adds VIP-llava to transformers by younesbelkada in 27932
* Fix Vip-llava docs by younesbelkada in 28085
FastSpeech2Conformer
The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.
FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.
* Add FastSpeech2Conformer by connor-henderson in 23439
Wav2Vec2-BERT
The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
* Add new meta w2v2-conformer BERT-like model by ylacombe in 28165
* Add w2v2bert to pipeline by ylacombe in 28585
4-bit serialization
Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest `bitsandbytes` package from pypi `pip install -U bitsandbytes`, load your model in 4-bit precision and call `save_pretrained` / `push_to_hub`. An example repo [here](https://huggingface.co/ybelkada/Mixtral-8x7B-Instruct-v0.1-bnb-4bit)
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model.push_to_hub("ybelkada/opt-125m-bnb-4bit")
* [bnb] Let's make serialization of 4bit models possible by poedator in 26037
* [`Docs`] Add 4-bit serialization docs by younesbelkada in 28182
4D Attention mask
Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.
* 4D `attention_mask` support by poedator in 27539
Improved quantization support
Ability to customise which modules are quantized and which are not.
* [`Awq`] Enable the possibility to skip quantization for some target modules by younesbelkada in 27950
* add `modules_in_block_to_quantize` arg in GPTQconfig by SunMarc in 27956
Added fused modules support
* [docs] Fused AWQ modules by stevhliu in 27896
* [`Awq`] Add llava fused modules support by younesbelkada in 28239
* [`Mixtral` / `Awq`] Add mixtral fused modules for Awq by younesbelkada in 28240
SDPA Support for LLaVa, Mixtral, Mistral
* Fix SDPA correctness following torch==2.1.2 regression by fxmarty in 27973
* [`Llava` / `Vip-Llava`] Add SDPA into llava by younesbelkada in 28107
* [`Mixtral` & `Mistral`] Add support for sdpa by ArthurZucker in 28133
* [SDPA] Make sure attn mask creation is always done on CPU by patrickvonplaten in 28400
* Fix SDPA tests by fxmarty in 28552
Whisper: Batched state-of-the-art long-form transcription
All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!
For more information see: https://github.com/huggingface/transformers/pull/27658.
* [Whisper] Finalize batched SOTA long-form generation by patrickvonplaten in 27658
Generation: assisted generation upgrades, speculative decoding, and ngram speculation
[Assisted generation](https://huggingface.co/blog/assisted-generation) was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate [ngram speculation](https://twitter.com/joao_gante/status/1747322413006643259), and opens the door for new candidate generation methods. Additionally, we've added the [speculative decoding](https://arxiv.org/abs/2211.17192) strategy on top of assisted generation: when you call assisted generation with an assistant model and `do_sample=True`, you'll benefit from the faster speculative decoding sampling 🏎️💨
* Generate: `assisted_decoding` now accepts arbitrary candidate generators by gante in 27751
* Generate: assisted decoding now uses `generate` for the assistant by gante in 28031
* Generate: speculative decoding by gante in 27979
* Generate: fix speculative decoding by gante in 28166
* Adding Prompt lookup decoding by apoorvumang in 27775
* Fix _speculative_sampling implementation by ofirzaf in 28508
torch.load pickle protection
Adding pickle protection via weights_only=True in the torch.load calls.
* make torch.load a bit safer by julien-c in 27282
Build methods for TensorFlow Models
Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper `build()` methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.
* Proper build() methods for TF by Rocketknight1 in 27794
* Replace build() with build_in_name_scope() for some TF tests by Rocketknight1 in 28046
* More TF fixes by Rocketknight1 in 28081
* Even more TF test fixes by Rocketknight1 in 28146