Torchaudio

Latest version: v2.3.0

Safety actively analyzes 629564 Python packages for vulnerabilities to keep your Python projects secure.

Page 11 of 15

0.13.1

This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.

Bug Fix
IO
* Make buffer size configurable in ffmpeg file object operations and set size in backend (2810)
* Fix issue with the missing video frame in StreamWriter (2789)
* Fix decimal FPS handling StreamWriter (2831)
* Fix wrong frame allocation in StreamWriter (2905)
* Fix duplicated memory allocation in StreamWriter (2906)
Model
* Fix HuBERT model initialization (2846, 2886)
Recipe
* Fix issues in HuBERT fine-tuning recipe (2851)
* Fix automatic mixed precision in HuBERT pre-training recipe (2854)

0.13

| ----- | ----- | ----- | ----- | ----- |

0.13.0

- Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
- New datasets and metadata mode for the SUPERB benchmark
- Custom language model support for CTC beam search decoding
- StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features
* MUSDB_HQ Dataset, which is used in Hybrid Demucs training ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.datasets.MUSDB_HQ.html#torchaudio.datasets.MUSDB_HQ))
* Hybrid Demucs model architecture ([docs](https://pytorch.org/audio/0.13.0/generated/torchaudio.models.HDemucs.html#torchaudio.models.HDemucs))
* Three factory functions suitable for different sample rate ranges
* Pre-trained pipelines ([docs](https://pytorch.org/audio/0.13.0/pipelines.html#id46)) and [tutorial](https://pytorch.org/audio/0.13.0/tutorials/hybrid_demucs_tutorial.html)

SDR Results of pre-trained pipelines on MUSDB-HQ test set
| Pipeline | All | Drums | Bass | Other | Vocals |
| ----- | ----- | ----- | ----- | ----- | ----- |

0.12.1

This is a minor release, which is compatible with [PyTorch 1.12.1](https://github.com/pytorch/pytorch/releases/tag/v1.12.1) and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix
- 2560 Fix fall back failure in sox_io backend
- 2588 Fix hubert fine-tuning recipe bugs

Improvement
- 2552 Remove unused boost source code
- 2527 Improve speech enhancement tutorial
- 2544 Update forced alignment tutorial
- 2595 Update data augmentation tutorial

For the full feature of v0.12, please refer to [the v0.12.0 release note](https://github.com/pytorch/audio/releases/tag/v0.12.0).

0.12

0.12.0

* CTC beam search decoder
* New beamforming modules and methods
* Streaming API

[Beta] CTC beam search decoder
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from [Flashlight](https://arxiv.org/pdf/2201.12465.pdf) ([GitHub](https://github.com/flashlight/flashlight)). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.

For usage details, please check out the [documentation](https://pytorch.org/audio/0.12.0/models.decoder.html#ctcdecoder) and [ASR inference tutorial](https://pytorch.org/audio/0.12.0/tutorials/asr_inference_with_ctc_decoder_tutorial.html).

[Beta] New beamforming modules and methods
To improve flexibility in usage, the release adds two new beamforming modules under `torchaudio.transforms`: [SoudenMVDR](https://pytorch.org/audio/0.12.0/transforms.html#soudenmvdr) and [RTFMVDR](https://pytorch.org/audio/0.12.0/transforms.html#rtfmvdr). They differ from [MVDR](https://pytorch.org/audio/0.11.0/transforms.html#mvdr) mainly in that they:
* Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
* Add `reference_channel` as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.

Besides the two modules, the release adds new function-level beamforming methods under `torchaudio.functional`. These include
* [psd](https://pytorch.org/audio/0.12.0/functional.html#psd)
* [mvdr_weights_souden](https://pytorch.org/audio/0.12.0/functional.html#mvdr-weights-souden)
* [mvdr_weights_rtf](https://pytorch.org/audio/0.12.0/functional.html#mvdr-weights-rtf)
* [rtf_evd](https://pytorch.org/audio/0.12.0/functional.html#rtf-evd)
* [rtf_power](https://pytorch.org/audio/0.12.0/functional.html#rtf-power)
* [apply_beamforming](https://pytorch.org/audio/0.12.0/functional.html#apply-beamforming)

For usage details, please check out the documentation at [torchaudio.transforms](https://pytorch.org/audio/0.12.0/transforms.html#multi-channel) and [torchaudio.functional](https://pytorch.org/audio/0.12.0/functional.html#multi-channel) and the [Speech Enhancement with MVDR Beamforming tutorial](https://pytorch.org/audio/0.12.0/tutorials/mvdr_tutorial.html).

[Beta] Streaming API
`StreamReader` is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
* Decode various audio and video formats, including MP4 and AAC.
* Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
* Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
* Apply various audio and video filters, such as low-pass filter and image scaling.
* Decode video with Nvidia's hardware-based decoder (NVDEC).

For usage details, please check out the [documentation](https://pytorch.org/audio/0.12.0/io.html#streamreader) and tutorials:
* [Media Stream API - Pt.1](https://pytorch.org/audio/0.12.0/tutorials/streaming_api_tutorial.html)
* [Media Stream API - Pt.2](https://pytorch.org/audio/0.12.0/tutorials/streaming_api2_tutorial.html)
* [Online ASR with Emformer RNN-T](https://pytorch.org/audio/0.12.0/tutorials/online_asr_tutorial.html)
* [Device ASR with Emformer RNN-T](https://pytorch.org/audio/0.12.0/tutorials/device_asr.html)
* [Accelerated Video Decoding with NVDEC](https://pytorch.org/audio/0.12.0/hw_acceleration_tutorial.html)

† To use `StreamReader`, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.

Backwards-incompatible changes
I/O
* MP3 decoding is now handled by FFmpeg in sox_io backend. (2419, 2428)
* FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with `torchaudio.load`, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
* Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
* `torchaudio.info` now returns `num_frames=0` for MP3.

Models
* Change underlying implementation of RNN-T hypothesis to tuple (2339)
* In release 0.11, `Hypothesis` subclassed `namedtuple`. Containers of `namedtuple` instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, `Hypothesis` has been modified in release 0.12 to instead alias `tuple`. This affects `RNNTBeamSearch` as it accepts and returns a list of `Hypothesis` instances.

Bug Fixes
Ops
* Fix return dtype in MVDR module (2376)
* In release 0.11, the MVDR module converts the dtype of input spectrum to `complex128` to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.
Build
* Fix Kaldi submodule integration (2269)
* Pin jinja2 version for build_docs (2292)
* Use sourceforge url to fetch zlib (2297)

New Features
I/O
* Add Streaming API (2041, 2042, 2043, 2044, 2045, 2046, 2047, 2111, 2113, 2114, 2115, 2135, 2164, 2168, 2202, 2204, 2263, 2264, 2312, 2373, 2378, 2402, 2403, 2427, 2429)
* Add YUV420P format support to Streaming API (2334)
* Support specifying decoder and its options (2327)
* Add NV12 format support in Streaming API (2330)
* Add HW acceleration support on Streaming API (2331)
* Add file-like object support to Streaming API (2400)
* Make FFmpeg log level configurable (2439)
* Set the default ffmpeg log level to FATAL (2447)
Ops
* New beamforming methods (2227, 2228, 2229, 2230, 2231, 2232, 2369, 2401)
* New MVDR modules (2367, 2368)
* Add and refactor CTC lexicon beam search decoder (2075, 2079, 2089, 2112, 2117, 2136, 2174, 2184, 2185, 2273, 2289)
* Add lexicon free CTC decoder (2342)
* Add Pretrained LM Support for Decoder (2275)
* Move CTC beam search decoder to beta (2410)
Datasets
* Add QUESST14 dataset (2290, 2435, 2458)
* Add LibriLightLimited dataset (2302)

Improvements
I/O
* Use FFmpeg-based I/O as fallback in sox_io backend. (2416, 2418, 2423)
Ops
* Raise error for resampling int waveform (2318)
* Move multi-channel modules to a separate file (2382)
* Refactor MVDR module (2383)
Models
* Add an option to use Tanh instead of ReLU in RNNT joiner (2319)
* Support GroupNorm and re-ordering Convolution/MHA in Conformer (2320)
* Add extra arguments to hubert pretrain factory functions (2345)
* Add feature_grad_mult argument to HuBERTPretrainModel (2335)
Datasets
* Refactor LibriSpeech dataset (2387)
* Raising RuntimeErrors when datasets missing (2430)

Performance
* Make Pitchshift for faster by caching resampling kernel (2441)
The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takes `torchaudio.transforms.PitchShift`, after its first call, to perform the operation on `float32` Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.

| TorchAudio Version | 2 | 3 | 4 | 5 |
| ----- | ----- | ----- | ----- | ----- |

Page 11 of 15

Releases

Has known vulnerabilities

Previous Next

Torchaudio

Page 11 of 15

0.13.1

0.13

0.13.0

0.12.1

0.12

0.12.0

Page 11 of 15

Links

Releases