Torchaudio

Latest version: v2.3.0

Safety actively analyzes 629564 Python packages for vulnerabilities to keep your Python projects secure.

Page 6 of 15

2.1

Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail of the new FFmpeg integration mechanism.
1. Update to libsox integration
TorchAudio now depends on libsox installed separately from torchaudio. Sox I/O backend no longer supports file-like object. (This is supported by FFmpeg backend and soundfile)
Please refer to https://pytorch.org/audio/2.1/installation.html#optional-dependencies for the detail.

New Features

I/O
- Support overwriting PTS in `torchaudio.io.StreamWriter` (3135)
- Include format information after filter `torchaudio.io.StreamReader.get_out_stream_info` (3155)
- Support CUDA frame in `torchaudio.io.StreamReader` filter graph (3183, 3479)
- Support YUV444P in GPU decoder (3199)
- Add additional filter graph processing to `torchaudio.io.StreamWriter` (3194)
- Cache and reuse HW device context in GPU decoder (3178)
- Cache and reuse HW device context in GPU encoder (3215)
- Support changing the number of channels in `torchaudio.io.StreamReader` (3216)
- Support encode spec change in `torchaudio.io.StreamWriter` (3207)
- Support encode options such as compression rate and bit rate (3179, 3203, 3224)
- Add `420p10le` support to `torchaudio.io.StreamReader` CPU decoder (3332)
- Support multiple FFmpeg versions (3464, 3476)
- Support writing opus and mp3 with soundfile (3554)
- Add switch to disable sox integration and ffmpeg integration at runtime (3500)

Ops
- Add `torchaudio.io.AudioEffector` (3163, 3372, 3374)
- Add `torchaudio.transforms.SpecAugment` (3309, 3314)
- Add `torchaudio.functional.forced_align` (3348, 3355, 3533, 3536, 3354, 3365, 3433, 3357)
- Add `torchaudio.functional.merge_tokens` (3535, 3614)
- Add `torchaudio.functional.frechet_distance` (3545)

Models
- Add `torchaudio.models.SquimObjective` for speech enhancement (3042, 3087, 3512)
- Add `torchaudio.models.SquimSubjective` for speech enhancement (3189)
- Add `torchaudio.models.decoder.CUCTCDecoder` (3096)

Pipelines
- Add `torchaudio.pipelines.SquimObjectiveBundle` for speech enhancement (3103)
- Add `torchaudio.pipelines.SquimSubjectiveBundle` for speech enhancement (3197)
- Add `torchaudio.pipelines.MMS_FA` Bundle for forced alignment (3521, 3538)

Tutorials
- Add tutorial for `torchaudio.io.AudioEffector` (3226)
- Add tutorials for CTC forced alignment API (3356, 3443, 3529, 3534, 3542, 3546, 3566)
- Add tutorial for `torchaudio.models.decoder.CUCTCDecoder` (3297)
- Add tutorial for real-time av-asr (3511)
- Add tutorial for TorchAudio-SQUIM pipelines (3279, 3313)
- Split HW acceleration tutorial into nvdec/nvenc tutorials (3483, 3478)

Recipe
- Add TCPGen context-biasing Conformer RNN-T (2890)
- Add AV-ASR recipe (3278, 3421, 3441, 3489, 3493, 3498, 3492, 3532)
- Add multi-channel DNN beamforming training recipe (3036)

Backward-incompatible changes

Third-party libraries

In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.

SoX

`libsox` is used for various audio I/O, filtering operations.

Pre-built binaries are avaialble via package managers, such as `conda`, `apt` and `brew`. Please refer to the respective documetation.

The APIs affected include;

- `torchaudio.load` ("sox" backend)
- `torchaudio.info` ("sox" backend)
- `torchaudio.save` ("sox" backend)
- `torchaudio.sox_effects.apply_effects_tensor`
- `torchaudio.sox_effects.apply_effects_file`
- `torchaudio.functional.apply_codec` (also deprecated, see below)

Changes related to the removal: 3232, 3246, 3497, 3035

Flashlight Text

`flashlight-text` is the core of CTC decoder.

Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.

The APIs affected include;

- `torchaudio.models.decoder.CTCDecoder`

Changes related to the removal: 3232, 3246, 3236, 3339

Kaldi

A custom built `libkaldi` was used to implement `torchaudio.functional.compute_kaldi_pitch`. This function, along with libkaldi integration, is removed in this release. There is no replcement.

Changes related to the removal: 3368, 3403

I/O
- Switch to the backend dispatcher (3241)

To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher.
In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.

You can pass `backend` argument to `torchaudio.info`, `torchaudio.load` and `torchaudio.save` function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)

If you want to use the global backend mechanism, you can set the environment variable, `TORCHAUDIO_USE_BACKEND_DISPATCHER=0`.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.

Please see 2950 for the detail of migration work.

- Remove Tensor binding from StreamReader (3093, 3272)

`torchaudio.io.StreamReader` accepted a byte-string wrapped in 1D `torch.Tensor` object. This is no longer supported.
Please wrap the underlying data with `io.BytesIO` instead.

- Make I/O optional arguments kw-only (3208, 3227)

The optional arguments of `add_[audio|video]_stream` methods of `torchaudio.io.StreamReader` and `torchaudio.io.StreamWriter` are now keyword-only arguments.

- Drop the support of FFmpeg < 4.1 (3561, 3557)

Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.

Ops
- Use named file in `torchaudio.functional.apply_codec` (3397)

In previous versions, TorchAudio shipped custom built `libsox`, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic `libsox` linking, `torchaudio.functional.apply_codec` no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use `torchaudio.io.AudioEffector`.

- Switch to `lstsq` when solving InverseMelScale (3280)

Previously, `torchaudio.transform.InverseMelScale` ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in 2643.

This release switches to use `torch.linalg.lstsq`.

Models
- Improve RNN-T streaming decoding (3295, 3379)

The `infer` method of `torchaudio.models.RNNTBeamSearch` has been updated to accept series of previous hypotheses.

python

bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()

hypothesis = None
while streaming:
...
hypo, state = decoder.infer(
features,
length,
beam_width,
state=state,
hypothesis=hypothesis,
)
...
hypothesis = hypo
Previously this had to be hypothesis = hypo[0]

Deprecations

Ops

- Update and deprecate `torchaudio.functional.apply_codec` function (3386)

Due to the removal of custom libsox binding, `torchaudio.functional.apply_codec` no longer supports in-memory processing. Please migrate to `torchaudio.io.AudioEffector`.

Please refer to for the detailed usage of `torchaudio.io.AudioEffector`.

- https://pytorch.org/audio/2.1/generated/torchaudio.io.AudioEffector.html
- https://pytorch.org/audio/stable/tutorials/effector_tutorial.html

Bug Fixes

Models
- Fix the negative sampling in ConformerWav2Vec2PretrainModel (3085)
- Fix extract_features method for WavLM models (3350)

Tutorials
- Fix backtracking in forced alignment tutorial (3440)
- Fix initialization of `get_trellis` in forced alignment tutorial (3172)

Build
- Fix MKL issue on Intel mac build (3307)

I/O
- Surpress warning when saving vorbis with sox backend (3359)
- Fix g722 encoding in `torchaudio.io.StreamWriter` (3373)
- Refactor arg mapping in ffmpeg save function (3387)
- Fix save INT16 sox backend (3524)
- Fix SoundfileBackend method decorators (3550)
- Fix PTS initialization when using NVIDIA encoder (3312)

Ops
- Add non-default CUDA device support to `lfilter` (3432)

Improvements
I/O
- Set "experimental" automatically when using native opus/vorbis encoder (3192)
- Improve the performance of NV12 frame conversion (3344)
- Improve the performance of YUV420P frame conversion (3342)
- Refactor backend implementations (3547, 3548, 3549)
- Raise an error if `torchaudio.io.StreamWriter` is not opened (3152)
- Warn if decoding YUV images with different plane size (3201)
- Expose AudioMetadata (3556)
- Refactor the internal of `torchaudio.io.StreamReader` (3157, 3170, 3186, 3184, 3188, 3320, 3296, 3328, 3419, 3209)
- Refactor the internal of `torchaudio.io.StreamWriter` (3205, 3319, 3296, 3328, 3426, 3428)
- Refactor the FFmpeg abstraction layer (3249, 3251)
- Migrate the binding of FFmpeg utils to PyBind11 (3228)
- Simplify sox namespace (3383)
- Use const reference in sox implementation (3389)
- Ensure StreamReader returns tensors with requires_grad is False (3467)
- Set the default threads to 1 in StreamWriter (3370)
- Remove ffmpeg fallback from sox_io backend (3516)

Ops
- Add arbitrary dim Tensor support to mask_along_axis{,_iid} (3289)
- Fix resampling to support dynamic input lengths for onnx exports. (3473)
- Optimize Torchaudio Vad (3382)

Documentation
- Build and use GPU-enabled FFmpeg in doc CI (3045)
- Misc tutorial update (3449)
- Update notes on FFmpeg version (3480)
- Update documentation about dependencies (3517)
- Update I/O and backend docs (3555)

Tutorials
- Update data augmentation tutorial (3375)
- Add more explanation about `n_fft` (3442)

Build
- Resolve some compilation warnings (3471)
- Use pre-built binaries for ffmpeg extension (3460)
- Add aarch64 workflow (3553)
- Add CUDA 12.1 builds (3284)
- Update CUDA to 12.1 U1 (3563)

Recipe
- Fix Adam and AdamW initializers in wav2letter example (3145)
- Update Librispeech RNNT recipe to support Lightening 2.0 (3336)
- Update HuBERT/SSL training recipes to support Lightning 2.x (3396)
- Add wav2vec2 loss function in self_supervised_learning training recipe (3090)
- Add Wav2Vec2DataModule in self_supervised_learning training recipe (3081)

Other
- Use FFmpeg6 in build doc (3475)
- Use FFmpeg6 in unit test (3570)
- Migrate `torch.norm` to `torch.linalg.vector_norm` (3522)
- Migrate `torch.nn.utils.weight_norm` to `nn.utils.parametrizations.weight_norm` (3523)

2.1.0

Hilights

2.0.2

This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug fix
* 3239 Properly set samples passed to encoder (3204)
* 3238 Fix virtual function issue with CTC decoder (3230)
* 3245 Fix path-like object support in FFmpeg dispatcher (3243, 3248)
* 3261 Use scaled_dot_product_attention in Wav2vec2/HuBERT's SelfAttention (3253)
* 3264 Use scaled_dot_product_attention in WavLM attention (3252, 3265)

**Full Changelog**: https://github.com/pytorch/audio/compare/v2.0.1...v2.0.2

2.0.1

Highlights

2.0

- Data augmentation operators, e.g. convolution, additive noise, speed perturbation
- WavLM and XLS-R models and pre-trained pipelines
- Backend dispatcher powering revised `info`, `load`, `save` functions
- Dropped support of Python 3.7
- Added Python 3.11 support

[Beta] Data augmentation operators
The release adds several data augmentation operators under `torchaudio.functional` and `torchaudio.transforms`:
- `torchaudio.functional.add_noise`
- `torchaudio.functional.convolve`
- `torchaudio.functional.deemphasis`
- `torchaudio.functional.fftconvolve`
- `torchaudio.functional.preemphasis`
- `torchaudio.functional.speed`
- `torchaudio.transforms.AddNoise`
- `torchaudio.transforms.Convolve`
- `torchaudio.transforms.Deemphasis`
- `torchaudio.transforms.FFTConvolve`
- `torchaudio.transforms.Preemphasis`
- `torchaudio.transforms.Speed`
- `torchaudio.transforms.SpeedPerturbation`

The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.

For usage details, please refer to the documentation for [`torchaudio.functional`](https://pytorch.org/audio/2.0.0/functional.html) and [`torchaudio.transforms`](https://pytorch.org/audio/2.0.0/transforms.html), and tutorial [“Audio Data Augmentation”](https://pytorch.org/audio/2.0.0/tutorials/audio_data_augmentation_tutorial.html).

[Beta] WavLM and XLS-R models and pre-trained pipelines
The release adds two self-supervised learning models for speech and audio.
- [WavLM](https://ieeexplore.ieee.org/document/9814838) that is robust to noise and reverberation.
- [XLS-R](https://arxiv.org/abs/2111.09296) that is trained on cross-lingual datasets.

Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:
- `torchaudio.pipelines.WAVLM_BASE`
- `torchaudio.pipelines.WAVLM_BASE_PLUS`
- `torchaudio.pipelines.WAVLM_LARGE`
- `torchaudio.pipelines.WAV2VEC_XLSR_300M`
- `torchaudio.pipelines.WAV2VEC_XLSR_1B`
- `torchaudio.pipelines.WAV2VEC_XLSR_2B`

For usage details, please refer to [`factory function`](https://pytorch.org/audio/2.0.0/generated/torchaudio.models.Wav2Vec2Model.html#factory-functions) and [`pre-trained pipelines`](https://pytorch.org/audio/2.0.0/pipelines.html#id3) documentation.

Backend dispatcher
Release 2.0 introduces new versions of I/O functions `torchaudio.info`, `torchaudio.load` and `torchaudio.save`, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable `TORCHAUDIO_USE_BACKEND_DISPATCHER=1`; the new logic will be enabled by default in Release 2.1.

python
Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")

Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")

Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")

Please see [the documentation for `torchaudio`](https://pytorch.org/audio/2.0.0/torchaudio.html#future-api) for more details.

Backward-incompatible changes
- Dropped Python 3.7 support (3020)
Following the upstream PyTorch (https://github.com/pytorch/pytorch/pull/93155), the support for Python 3.7 has been dropped.

- Default to "precise" seek in `torchaudio.io.StreamReader.seek` (2737, 2841, 2915, 2916, 2970)
Previously, the `StreamReader.seek` method seeked into a key frame closest to the given time stamp. A new option `mode` has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.

- Removed deprecated/unused/undocumented functions from datasets.utils (2926, 2927)
The following functions are removed from `datasets.utils`
- `stream_url`
- `download_url`
- `validate_file`
- `extract_archive`.

Deprecations
Ops
- Deprecated 'onesided' init param for MelSpectrogram (2797, 2799)
`torchaudio.transforms.MelSpectrogram` assumes the `onesided` argument to be always `True`. The forward path fails if its value is `False`. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.

- Deprecated `"sinc_interpolation"` and `"kaiser_window"` option value in favor of `"sinc_interp_hann"` and `"sinc_interp_kaiser"` (2922)
The valid values of `resampling_method` argument of resampling operations (`torchaudio.transforms.Resample` and `torchaudio.functional.resample`) are changed. `"kaiser_window"` is now `"sinc_interp_kaiser"` and `"sinc_interpolation"` is `"sinc_interp_hann"`. The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer 2891.

- Deprecated sox initialization/shutdown public API functions (3010)
`torchaudio.sox_effects.init_sox_effects` and `torchaudio.sox_effects.shutdown_sox_effects` are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.

Models
- Deprecated static binding of Flashlight-text based CTC decoder (3055, 3089)
Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
**Note:** The API and numerical behavior does not change.
For more detail, please refer 3088.

I/O
- Deprecated file-like object support in sox_io (3033)
As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.
* I/O: `torchaudio.load`, `torchaudio.info` and `torchaudio.save`.
* Effects: `torchaudio.sox_effects.apply_effects_file` and `torchaudio.functional.apply_codec`.
For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
For effects, replacement functions will be added in the next release.
- Deprecated the use of Tensor as a container for byte string in StreamReader (3086)
`torchaudio.io.StreamReader` supports decoding media from byte strings contained in 1D tensors of `torch.uint8` type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with `io.BytesIO`.
<table class="tg">
<thead>
<tr>
<th class="tg-0pky">Deprecated</th>
<th class="tg-0pky">Migration</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-dvpl"><code>data = b"..."</code></br><code>src = torch.frombuffer(data, dtype=torch.uint8)</code></br><code>StreamReader(src)</code></td>
<td class="tg-dvpl"><code>data = b"..."</code></br><code>src = io.BytesIO(data)</code></br><code>StreamReader(src)</code></td>
</tr>
</tbody>
</table>

Bug Fixes
Ops
- Fixed contiguous error when backpropagating through `torchaudio.functional.lfilter` (3080)

Pipelines
- Added layer normalization to wav2vec2 large+ pretrained models (2873)
In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.

Without the change in 2873, the WER results are:
| Model | dev-clean | dev-other | test-clean | test-other |
|:------------------------------------------------------------------------------------------------|-----------:|-----------:|-----------:|-----------:|

1.37

</td>
</tr>
<tr>
<td>0.7
</td>
<td><p style="text-align: right">

Page 6 of 15

Releases

Has known vulnerabilities

Previous Next

Torchaudio

Page 6 of 15

2.1

2.1.0

2.0.2

2.0.1

2.0

1.37

Page 6 of 15

Links

Releases