Transformers

Latest version: v4.41.0

Safety actively analyzes 631215 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 26

4.37.1

Not secure

A patch release to resolve import errors from removed custom types in generation utils

* Add back in generation types 28681

4.37.0

Not secure

Model releases

Qwen2

Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

* Add qwen2 by JustinLin610 in 28436

Phi-2

Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.

* [Phi2] Add support for phi2 models by susnato in 28211
* [Phi] Extend implementation to use GQA/MQA. by gugarosa in 28163
* update docs to add the `phi-2` example by susnato in 28392
* Fixes default value of `softmax_scale` in `PhiFlashAttention2`. by gugarosa in 28537

SigLIP

The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.

* Add SigLIP by NielsRogge in 26522
* [SigLIP] Don't pad by default by NielsRogge in 28578

ViP-LLaVA

The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.

* Adds VIP-llava to transformers by younesbelkada in 27932
* Fix Vip-llava docs by younesbelkada in 28085

FastSpeech2Conformer

The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.

FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.

* Add FastSpeech2Conformer by connor-henderson in 23439

Wav2Vec2-BERT

The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

* Add new meta w2v2-conformer BERT-like model by ylacombe in 28165
* Add w2v2bert to pipeline by ylacombe in 28585

4-bit serialization

Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest `bitsandbytes` package from pypi `pip install -U bitsandbytes`, load your model in 4-bit precision and call `save_pretrained` / `push_to_hub`. An example repo [here](https://huggingface.co/ybelkada/Mixtral-8x7B-Instruct-v0.1-bnb-4bit)

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

model.push_to_hub("ybelkada/opt-125m-bnb-4bit")

* [bnb] Let's make serialization of 4bit models possible by poedator in 26037
* [`Docs`] Add 4-bit serialization docs by younesbelkada in 28182

4D Attention mask

Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.

* 4D `attention_mask` support by poedator in 27539

Improved quantization support

Ability to customise which modules are quantized and which are not.
* [`Awq`] Enable the possibility to skip quantization for some target modules by younesbelkada in 27950
* add `modules_in_block_to_quantize` arg in GPTQconfig by SunMarc in 27956

Added fused modules support
* [docs] Fused AWQ modules by stevhliu in 27896
* [`Awq`] Add llava fused modules support by younesbelkada in 28239
* [`Mixtral` / `Awq`] Add mixtral fused modules for Awq by younesbelkada in 28240

SDPA Support for LLaVa, Mixtral, Mistral

* Fix SDPA correctness following torch==2.1.2 regression by fxmarty in 27973
* [`Llava` / `Vip-Llava`] Add SDPA into llava by younesbelkada in 28107
* [`Mixtral` & `Mistral`] Add support for sdpa by ArthurZucker in 28133
* [SDPA] Make sure attn mask creation is always done on CPU by patrickvonplaten in 28400
* Fix SDPA tests by fxmarty in 28552

Whisper: Batched state-of-the-art long-form transcription

All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!

For more information see: https://github.com/huggingface/transformers/pull/27658.

* [Whisper] Finalize batched SOTA long-form generation by patrickvonplaten in 27658

Generation: assisted generation upgrades, speculative decoding, and ngram speculation

[Assisted generation](https://huggingface.co/blog/assisted-generation) was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate [ngram speculation](https://twitter.com/joao_gante/status/1747322413006643259), and opens the door for new candidate generation methods. Additionally, we've added the [speculative decoding](https://arxiv.org/abs/2211.17192) strategy on top of assisted generation: when you call assisted generation with an assistant model and `do_sample=True`, you'll benefit from the faster speculative decoding sampling 🏎️💨

* Generate: `assisted_decoding` now accepts arbitrary candidate generators by gante in 27751
* Generate: assisted decoding now uses `generate` for the assistant by gante in 28031
* Generate: speculative decoding by gante in 27979
* Generate: fix speculative decoding by gante in 28166
* Adding Prompt lookup decoding by apoorvumang in 27775
* Fix _speculative_sampling implementation by ofirzaf in 28508

torch.load pickle protection

Adding pickle protection via weights_only=True in the torch.load calls.

* make torch.load a bit safer by julien-c in 27282

Build methods for TensorFlow Models

Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper `build()` methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.

* Proper build() methods for TF by Rocketknight1 in 27794
* Replace build() with build_in_name_scope() for some TF tests by Rocketknight1 in 28046
* More TF fixes by Rocketknight1 in 28081
* Even more TF test fixes by Rocketknight1 in 28146

4.36.2

Not secure

Patch release to resolve some critical issues relating to the recent cache refactor, flash attention refactor and training in the multi-gpu and multi-node settings:

* Resolve training bug with PEFT + GC 28031
* Resolve cache issue when going beyond context window for Mistral/Mixtral FA2 28037
* Re-enable passing `config` to `from_pretrained` with FA 28043
* Fix resuming from checkpoint when using FDSP with FULL_STATE_DICT 27891
* Resolve bug when saving a checkpoint in the multi-node setting 28078

4.36.1

Not secure

A patch release for critical torch issues mostly:

- Fix SDPA correctness following torch==2.1.2 regression 27973
- [Tokenizer Serialization] Fix the broken serialisation 27099
- Fix bug with rotating checkpoints 28009
- Hot-fix-mixstral-loss ([27948](https://github.com/huggingface/transformers/pull/27948))

🔥

4.36.0

Not secure

New model additions

Mixtral

Mixtral is the new open-source model from Mistral AI announced by the blogpost [Mixtral of Experts](https://mistral.ai/news/mixtral-of-experts/). The model has been proven to have comparable capabilities to Chat-GPT according to the benchmark results shared on the release blogpost.

<img src="https://github.com/huggingface/transformers/assets/49240599/f6ee43a9-0f74-4473-a02d-f8b3abc6d614" width="500">

The architecture is a sparse Mixture of Experts with Top-2 routing strategy, similar as `NllbMoe` architecture in transformers. You can use it through `AutoModelForCausalLM` interface:

py
>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B", torch_dtype=torch.float16, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-8x7B")

>>> prompt = "My favourite condiment is"

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]

The model is compatible with existing optimisation tools such Flash Attention 2, `bitsandbytes` and PEFT library. The checkpoints are release under [`mistralai`](https://huggingface.co/mistralai) organisation on the Hugging Face Hub.

Llava / BakLlava

Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.

<img src="https://cdn-uploads.huggingface.co/production/uploads/62441d1d9fdefb55a0b7d12c/FPshq08TKYD0e-qwPLDVO.png" width="800">

The Llava model was proposed in [Improved Baselines with Visual Instruction Tuning](https://arxiv.org/pdf/2310.03744) by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.

* [`Llava`] Add Llava to transformers by younesbelkada in 27662
* [LLaVa] Some improvements by NielsRogge in 27895

The integration also includes [`BakLlava`](https://github.com/SkunkworksAI/BakLLaVA) which is a Llava model trained with Mistral backbone.

The mode is compatible with `"image-to-text"` pipeline:

py
from transformers import pipeline
from PIL import Image
import requests

model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)

And you can find all Llava weights under [`llava-hf`](https://huggingface.co/llava-hf) organisation on the Hub.

SeamlessM4T v2

SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the [previous version](https://huggingface.co/docs/transformers/v4.36.0/en/model_doc/seamless_m4t.md) and was proposed in [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team from Meta AI.

For more details on the differences between v1 and v2, refer to section [Difference with SeamlessM4T-v1](https://huggingface.co/docs/transformers/v4.36.0/en/model_doc/seamless_m4t_v2#difference-with-seamlessm4t-v1).

SeamlessM4T enables multiple tasks without relying on separate models:

- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)

* Add SeamlessM4T v2 by ylacombe in 27779

PatchTST

The PatchTST model was proposed in [A Time Series is Worth 64 Words: Long-term Forecasting with Transformers](https://arxiv.org/abs/2211.14730) by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

At a high level, the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:

![patchtst](https://github.com/huggingface/transformers/assets/8100/37f8d4a9-bcb8-41a9-9518-a34c11874ff6)

* [Time series] Add PatchTST by psinthong in 25927
* [Time series] Add PatchTST by kashif in 27581

PatchTSMixer

The PatchTSMixer model was proposed in [TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting](https://arxiv.org/pdf/2306.09364.pdf) by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression.

* [Time series] Add PatchTSMixer by ajati in 26247

CLVP

The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in [Better speech synthesis through scaling](https://arxiv.org/abs/2305.07243) by James Betker.

* Add CLVP by susnato in 24745

4.35.2

Not secure

A patch release was made for the following commit:

- [`tokenizers`] update tokenizers version pin 27494

to fix all the issues with versioning regarding `tokenizers` and `huggingface_hub`

Page 3 of 26

Releases

Has known vulnerabilities

Previous Next

Transformers

Page 3 of 26

4.37.1

4.37.0

4.36.2

4.36.1

4.36.0

4.35.2

Page 3 of 26

Links

Releases