Changelogs » Torch



- Add CPU-only binary releases that are 10x smaller in size than the full binary with CUDA capabilities.

As always, links to our binaries are on

New features
- Add [Cosine Annealing Learning Rate Scheduler](
- add `reduce` argument to `PoissonNLLLoss` to be able to compute unreduced losses
- Allow `target.requires_grad=True` in `l1_loss` and `mse_loss` (compute loss wrt `target`)
- Add [`random_split`]( that randomly splits a dataset into non-overlapping new datasets of given lengths
- Introduced scopes to annotate ONNX graphs to have better [TensorBoard visualization of models](
Allow `map_location` in `torch.load` to be a string, such as `map_location='cpu'` or `map_location='cuda:2'`

Bug Fixes

Data Loader / Datasets / Multiprocessing
- Made DataLoader workers more verbose on bus error and segfault. Additionally, add a `timeout` option to the DataLoader, which will error if sample loading time exceeds the given value.
- DataLoader workers used to all have the same random number generator (RNG) seed because of the semantics of `fork` syscall. Now, each worker will have it's RNG seed set to `base_seed + worker_id` where `base_seed` is a random int64 value generated by the parent process. You may use `torch.initial_seed()` to access this value in `worker_init_fn`, which can be used to set other seeds (e.g. NumPy) before data loading. `worker_init_fn` is an optional argument that will be called on each worker subprocess with the worker id as input, after seeding and before data loading
- Add additional signal handling in DataLoader worker processes when workers abruptly die.
- Negative value for n_workers now gives a ValueError
- fixed a typo in `ConcatDataset.cumulative_sizes` attribute name
- Accept longs in default_collate for dataloader in python 2
- Re-initialize autograd engine in child processes
- Fix distributed dataloader so it pins memory to current GPU not GPU 0.

- allow cudnn for fp16 batch norm
- Use `enabled` argument in `torch.autograd.profiler.emit_nvtx` (was being ignored)
- Fix cuBLAS arguments for fp16 ``
- Fix CUDA index_fill_ boundary check with small tensor size
- Fix CUDA Multinomial checks
- Fix CUDA version typo in warning
- Initialize cuda before setting cuda tensor types as default
- Add missing lazy_init in cuda python module
- Lazy init order in set device, should not be called in getDevCount
- Make torch.cuda.empty_cache() a no-op when cuda is not initialized

- Assert MKL ld* conditions for ger, gemm, and gemv

torch operators
- Fix `tensor.repeat` when the underlying storage is not owned by `torch` (for example, coming from numpy)
- Add proper shape checking to
- Add check for slice shape match in index_copy_ and index_add_.
- Fix use after free when advanced indexing tensors with tensors
- Fix triu and tril for zero-strided inputs on gpu
- Fix blas addmm (gemm) condition check
- Fix topk work size computation
- Fix reduction functions to respect the stride of the output
- Improve float precision stability of `linspace` op, fix 4419.

- Fix python gc race condition with THPVariable_traverse

nn layers
- Fix padding_idx getting ignored in backward for Embedding(sparse=True)
Fix cosine_similarity's output shape
- Add rnn args check
- NLLLoss works for arbitrary dimensions
- More strict shape check on Conv operators
- Fix maxpool3d / avgpool3d crashes
- Fix setting using running stats in InstanceNorm*d

- Fix DataParallel scattering for empty lists / dicts / tuples
- Fix refcycles in DataParallel scatter and gather (fix elevated memory usage)
- Broadcast output requires_grad only if corresponding input requires_grad

- Remove hard file offset reset in load()
- Have __sizeof__ account for size of stored elements
- Fix undefined FileNotFoundError
- make torch.set_num_threads also set MKL threads (take 2)

- Fix wrong learning rate evaluation in CosineAnnealingLR in Python 2

Performance improvements
- slightly simplified math in IndexToOffset
- improve performance of maxpooling backwards
- Add cublas batched gemm support.
- Rearrange dimensions for pointwise operations for better performance.
- Improve memory access patterns for index operations.
- Improve CUDA softmax performance
- Fixed double memory accesses of several pointwise operations.

Documentation and UX Improvements
- Better error messages for blas ops with cuda.LongTensor
- Add missing trtrs, orgqr, ormqr docs
- change doc for Adaptive Pooling
- Fix MultiLabelMarginLoss docs
- More docs for Conv1d Conv2d
- Improve Tensor.scatter_ doc
- [docs] Note zero defaults for hidden state/cell
- Improve doc
- Improve docs for torch and torch.Tensor
- Added explicit tuple dimensions to doc for Conv1d.
- Improve svd doc
- Correct instancenorm input size
- Fix StepLR example docs


>>> sum = torch.tensor([2, 3]).sum()
>>> sum
>>> sum.size()

Accumulating losses

Consider the widely used pattern ``total_loss +=[0]`` before 0.4.0. ``loss`` was a ``Variable`` wrapping a tensor of size ``(1,)``, but in 0.4.0 ``loss`` is now a scalar and has ``0`` dimensions. Indexing into a scalar doesn't make sense (it gives a warning now, but will be a hard error in 0.5.0): use ``loss.item()`` to get the Python number from a scalar.

Note that if you don't convert to a Python number when accumulating losses, you may find increased memory usage in your program. This is because the right-hand-side of the above expression used to be a Python float, while it is now a zero-dim Tensor.  The total loss is thus accumulating Tensors and their gradient history, which may keep around large autograd graphs for much longer than necessary.

Deprecation of ``volatile`` flag

The ``volatile`` flag is now deprecated and has no effect. Previously, any computation that involves a ``Variable`` with ``volatile=True`` won't be tracked by ``autograd``. This has now been replaced by [a set of more flexible context managers]( including ``torch.no_grad()``, ``torch.set_grad_enabled(grad_mode)``, and others.

>>> x = torch.zeros(1, requires_grad=True)
>>> with torch.no_grad():
...     y = x * 2
>>> y.requires_grad
>>> is_train = False
>>> with torch.set_grad_enabled(is_train):
...     y = x * 2
>>> y.requires_grad
>>> torch.set_grad_enabled(True)   this can also be used as a function
>>> y = x * 2
>>> y.requires_grad
>>> torch.set_grad_enabled(False)
>>> y = x * 2
>>> y.requires_grad

[``dtypes``](, [``devices``]( and NumPy-style creation functions

In previous versions of PyTorch, we used to specify data type (e.g. float vs double), device type (cpu vs cuda) and layout (dense vs sparse) together as a "tensor type". For example, ``torch.cuda.sparse.DoubleTensor`` was the ``Tensor`` type respresenting``double`` data type, living on CUDA devices, and with [COO sparse tensor layout](

In this release, we introduce [``torch.dtype``](, [``torch.device``]( and [``torch.layout``]( classes to allow better management of these properties via NumPy-style creation functions.


Below is a complete list of available [``torch.dtype``]( (data types) and their corresponding tensor types.

| Data type                 | ``torch.dtype``                        | Tensor types              |
|:------------------------- |:-------------------------------------- | :------------------------ |
| 32-bit floating point     | ``torch.float32`` or ``torch.float``   | ``torch.*.FloatTensor``   |
| 64-bit floating point     | ``torch.float64`` or ``torch.double``  | ``torch.*.DoubleTensor``  |
| 16-bit floating point     | ``torch.float16`` or ``torch.half``    | ``torch.*.HalfTensor``    |
| 8-bit integer (unsigned)  | ``torch.uint8``                        | ``torch.*.ByteTensor``    |
| 8-bit integer (signed)    | ``torch.int8``                         | ``torch.*.CharTensor``    |
| 16-bit integer (signed)   | ``torch.int16``   or ``torch.short``   | ``torch.*.ShortTensor``   |
| 32-bit integer (signed)   | ``torch.int32``   or ````     | ``torch.*.IntTensor``     |
| 64-bit integer (signed)   | ``torch.int64``   or ``torch.long``    | ``torch.*.LongTensor``    |

Use [``torch.set_default_dtype``]( and [``torch.get_default_dtype``]( to manipulate default ``dtype`` for floating point tensors.


A [``torch.device``]( contains a device type (``'cpu'`` or ``'cuda'``) and optional device ordinal (id) for the device type. It can be initilized with ``torch.device('{device_type}')`` or ``torch.device('{device_type}:{device_ordinal}')``.

If the device ordinal is not present, this represents the current device for the device type; e.g., ``torch.device('cuda')`` is equivalent to ``torch.device('cuda:X')`` where ``X`` is the result of ``torch.cuda.current_device()``.


[``torch.layout``]( represents the data layout of a [``Tensor``]( Currently``torch.strided`` (dense tensors) and ``torch.sparse_coo`` (sparse tensors with COO format) are supported.

Creating ``Tensor``s

[Methods that create a ``Tensor``]( now also take in ``dtype``, ``device``, ``layout``, and ``requires_grad`` options to specify the desired attributes on the returned ``Tensor``. For example,

>>> device = torch.device("cuda:1")
>>> x = torch.randn(3, 3, dtype=torch.float64, device=device)


>>> torch.tensor([1]) + torch.tensor(2.5)
>>> torch.tensor(True) + 5

Type Promotion: in-place operations whose result_type is a lower dtype category (bool < integer < floating-point) than the in-place operand now throw an Error.  ([22273](, [26981](

<p align="center">
<table align="center">
<tr><th>Version 1.2</th><th>Version 1.3</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> int_tensor = torch.tensor(1)
>>> int_tensor.add_(1.5)
>>> bool_tensor = torch.tensor(True)
>>> bool_tensor.add_(5)
<td><sub><pre lang="python">
>>> int_tensor = torch.tensor(1)
>>> int_tensor.add_(1.5)
RuntimeError: result type Float cannot be cast to the desired output type Long
>>> bool_tensor = torch.tensor(True)
>>> bool_tensor.add_(5)
RuntimeError: result type Long cannot be cast to the desired output type Bool

These rules can be checked at runtime via [torch.can_cast](

`torch.flatten`: 0-dimensional inputs now return a 1-dim tensor.  ([25406](

<p align="center">
<table align="center">
<tr><th>Version 1.2</th><th>Version 1.3</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.flatten(torch.tensor(0))
<td><sub><pre lang="python">
>>> torch.flatten(torch.tensor(0))

`nn.functional.affine_grid`: when `align_corners = True`, changed the behavior of 2D affine transforms on 1D data and 3D affine transforms on 2D data (i.e., when one of the spatial dimensions has unit size).

Previously, all grid points along a unit dimension were considered arbitrarily to be at -1, now they are considered to be at 0 (the center of the input image).

`torch.gels:` removed deprecated operator, use `torch.lstsq` instead.  ([26480](

`` made a number of Iterator attributes private (e.g. `num_workers`, `pin_memory`).  ([22273](

**[C++]** `Variable::backward` will no longer implicitly create a gradient for non-1-element Variables.  Previously, a gradient tensor of all 1s would be implicitly created . This behavior matches the Python API.  ([26150](

auto x = torch::randn({5, 5}, torch::requires_grad());
auto y = x * x;
// ERROR: "grad can be implicitly created only for scalar outputs"

[C++] All option specifiers (e.g. `GRUOptions::bidirectional_`) are now private, use the function variants (`GRUOptions::bidirectional(...))` instead. ([26419](


[Experimental]: Mobile Support

In PyTorch 1.3, we are launching experimental support for mobile. Now you can run any TorchScript model directly without any conversion. Here are the full list of features in this release:

* Support for full TorchScript inference on mobile;
* Prebuilt LibTorch libraries for Android/iOS on JCenter/CocoaPods;
* Java wrapper for Android with functionality to cover common inference cases (loading and invoking the model);
* Support for all forward ops on mobile CPU (backward ops are not supported yet);
* Some optimized fp32 operator implementations for ARM CPUs (based on Caffe2Go);
* Some optimized int8 operator implementations for ARM CPUs (based on QNNPACK);

We decided not to create a new framework for mobile so that you can use the same APIs you are already familiar with to run the same TorchScript models on Android/iOS devices without any format conversion. This way you can have the shortest path from research ideas to production-ready mobile apps.

The tutorials, demo apps and download links for prebuilt libraries can be found at:

This is an experimental release. We are working on other features like customized builds to make PyTorch smaller, faster and better for your specific use cases. Stay tuned and give us your feedback!

[Experimental]: Named Tensor Support

Named Tensors aim to make tensors easier to use by allowing users to associate explicit names with tensor dimensions. In most cases, operations that take dimension parameters will accept dimension names, avoiding the need to track dimensions by position. In addition, named tensors use names to automatically check that APIs are being used correctly at runtime, providing extra safety. Names can also be used to rearrange dimensions, for example, to support "broadcasting by name" rather than "broadcasting by position".

Create a named tensor by passing a `names` argument into most tensor factory function.

>>> tensor = torch.zeros(2, 3, names=('C', 'N'))
tensor([[0., 0., 0.],
[0., 0., 0.]], names=('C', 'N'))

Named tensors propagate names across operations.

>>> tensor.abs()
tensor([[0., 0., 0.],
[0., 0., 0.]], names=('C', 'N'))

Rearrange to a desired ordering by using `align_to` .

>>> tensor = tensor.align_to('N', 'C', 'H', 'W')
>>> tensor.names, tensor.shape
(('N', 'C', 'H', 'W'), torch.Size([3, 2, 1, 1]))

And more! [Please see our documentation on named tensors.](

[Experimental]: Quantization support

PyTorch now supports quantization from the ground up, starting with support for quantized tensors. Convert a float tensor to a quantized tensor and back by:

x = torch.rand(10,1, dtype=torch.float32)
xq = torch.quantize_per_tensor(x, scale = 0.5, zero_point = 8, dtype=torch.quint8)
xq is a quantized tensor with data represented as quint8
xdq = x.dequantize()
convert back to floating point

We also support 8 bit quantized implementations of most common operators in CNNs, including:

* Tensor operations:
* view, clone, resize, slice
* add, multiply, cat, mean, max, sort, topk
* Modules/Functionals (in torch.nn.quantized)
* Conv2d
* Linear
* Avgpool2d, AdaptiveAvgpool2d, MaxPool2d, AdaptiveMaxPool2d
* Interpolate
* Upsample
* Fused operations for preserving better accuracy (in torch.nn.intrinsic)
* ConvReLU2d, ConvBnReLU2d, ConvBn2d
* LinearReLU
* add_relu

We also support dynamic quantized operators, which take in floating point activations, but use quantized weights (in torch.nn.quantized.dynamic).

* Linear

Quantization also requires support for methods to collect statistics from tensors and calculate quantization parameters (implementing interface torch.quantization.Observer). We support several methods to do so:

* MinMaxObserver
* MovingAverageMinMaxObserver
* PerChannelMinMaxObserver
* MovingAveragePerChannelMinMaxObserver
* HistogramObserver

For quantization aware training, we support fake-quantization operators and modules to mimic quantization during training:

* `torch.fake_quantize_per_tensor_affine`, `torch.fake_quantize_per_channel_affine`
* `torch.quantization.FakeQuantize`

In addition, we also support workflows in torch.quantization for:

* post-training dynamic quantization
* static post training quantization
* quantization aware training

All quantized operators are compatible with TorchScript.

For more details, see the documentation at:

Type Promotion

Arithmetic and comparison operations may now perform mixed-type operations that promote to a common dtype.

This below example was not allowed in version 1.2. In version 1.3, the same code returns a tensor with `dtype=torch.float32`.

>>> torch.tensor([1], + torch.tensor([1], dtype=torch.float32)

See the full [documentation]( for more details.

* `torch.result_type` Provide function to determine result of mixed-type operations ([26012](
* `torch.can_cast` Expose casting rules for type promotion ([26805](
* `torch.promote_types` Expose promotion logic ([26655](


`nn.functional.affine_grid` / `nn.functional.grid_sample`: USING The Align_CORNER Default value is now deprecated, because it will be changed in 1.4 release.

The `align_corner` parameter was added in this release; the behavior in the previous release was equivalent to setting the parameter to `True`.  This is also the current default value but it will be changed to `False` from 1.4 release. Note that using the default will trigger a warning as demonstrated below; set the value explicitly to remove the warning.

>>> torch.nn.functional.affine_grid(torch.randn(1,2,3),
UserWarning: Default grid_sample and affine_grid behavior will be changed
to align_corners=False from 1.4.0.
See the documentation of grid_sample for details.

>>> torch.nn.functional.affine_grid(torch.randn(1,2,3),

[C++] Deprecate `torch::Tensor::data<T>()` in favor of `torch::Tensor::data_ptr<T>()` ([24847](, [24886](

New Features

TensorBoard: 3D Mesh and Hyperparameter Support

`torch.utils.tensorboard` supports 3D mesh and points plus hyperparameter logging. More details can be found in [the documentation]( for `SummaryWriter` with `add_mesh` and `add_hparams`.

A simple example exercising both methods:

from torch.utils.tensorboard import SummaryWriter

vertices_tensor = torch.as_tensor([
[1, 1, 1],
[-1, -1, 1],
[1, -1, -1],
[-1, 1, -1],
], dtype=torch.float).unsqueeze(0)
colors_tensor = torch.as_tensor([
[255, 0, 0],
[0, 255, 0],
[0, 0, 255],
[255, 0, 255],
faces_tensor = torch.as_tensor([
[0, 2, 3],
[0, 3, 1],
[0, 1, 2],
[1, 3, 2],

with SummaryWriter() as w:
w.add_mesh('my_mesh', vertices=vertices_tensor, colors=colors_tensor, faces=faces_tensor)
for i in range(5):
w.add_hparams({'lr': 0.1*i, 'bsize': i},
{'hparam/accuracy': 10*i, 'hparam/loss': 10*i})


This release adds macOS support for `torch.distributed` with the Gloo backend. You can more easily switch from development (e.g. on macOS) to deployment (e.g. on Linux) without having to change a single line of code. The prebuilt binaries for macOS (stable and nightly) include support out of the box.

* `torch.distributed.all_reduce_coalesced` Support allreduce of a list of same-device tensors ([24949](, [25470](, [24876](
* `torch.distributed.all_reduce` Add bitwise reduction ops (BAND, BOR, BXOR) ([26824](

Libtorch Binaries with C++11 ABI

We now provide Libtorch binaries for building applications compatible with the C++11 ABI. The download links for libtorch binaries with C++11 ABI can be found in “QUICK START LOCALLY”.

New TorchScript features

* Add `not in` support for TorchScript ([23637](
* You can now raise exceptions in one side of an if branch ([23565](
* Add `torch.jit.is_scripting()` API ([25955](
* Make assertions like `x is not None` unwrap the optional type of `x` ([23949](
* Add dictionary augmented assignment (`+=`) support to TorchScript ([23639](
* Support `grad` and `data` attribute for tensor in TorchScript ([23842](
* Add `ignore` for TorchScript classes ([23614](
* Support nn.GRU in script ([23266](
* Support tensor as a key type in TorchScript ([23638](
* Add support for ModuleDict ([25715](
* Bind `set_grad_enabled()` into TorchScript ([25350](
* Add `in` membership checks for lists ([25796](
* Add `tuple` keyword ([25474](
* Add `__getitem__` to class types ([25664](
* Add `__setitem__` to class types ([25750](
* Make JIT dicts ordered, matching Python 3.6+ semantics ([26465](
* Added invert bitwise operation to TorchScript ([22324](
* Add `min()` and `max()` for lists to TorchScript ([26351](
* Support iterables and ranges in list comprehensions ([26768](


C++ Frontend Improvements

We are on our way to better API parity between our Python and C++ frontends. Specifically, we made the following improvements:


* Tensor autograd APIs
* `torch::Tensor::data` Added ([26008](
* `torch::Tensor::grad` Don’t create a gradient for non-1-element Variables [BC-breaking] ([26150](
* `torch::Tensor::is_leaf` Added ([26186](
* `torch::Tensor::output_nr` Added ([26216](
* `torch::Tensor::_version` Added ([26217](
* Add support for custom autograd functions in C++ API
* For example usage, please see the PR description and test cases in ([23572](, [23628](, and [23803](
* `torch::autograd::backward` and `torch::autograd::grad` ([24342](
* `torch::autograd::Variable::register_hook` ([24393](

New torch::nn modules

* Containers
* torch::nn::ModuleList ([24317](
* Linear layers
* torch::nn::Identity ([26713](
* Convolution layers
* torch::nn::Fold ([24160](
* Pooling layers
* torch::nn::MaxPool1d / MaxPool2d / MaxPool3d ([24860](, [26521](
* torch::nn::AvgPool1d / AvgPool2d / AvgPool3d ([25800](
* torch::nn::AdaptiveMaxPool1d / AdaptiveMaxPool2d / AdaptiveMaxPool3d ([26755](, [26772](, [26775](
* Loss functions
* torch::nn::L1Loss ([25902](
* Distance functions
* torch::nn::CosineSimilarity ([26424](
* torch::nn::PairwiseDistance ([26424](

New torch::nn::functional functions

* Pooling functions
* torch::nn::functional::max_pool1d / max_pool2d / max_pool3d ([26262](
* torch::nn::functional::max_pool1d_with_indices / max_pool2d_with_indices / max_pool3d_with_indices ([26521](
* torch::nn::functional::avg_pool1d / avg_pool2d / avg_pool3d ([26262](
* torch::nn::functional::adaptive_max_pool1d / adaptive_max_pool2d / adaptive_max_pool3d ([26755](, [26772](, [26775](
* torch::nn::functional::adaptive_max_pool1d_with_indices / adaptive_max_pool2d_with_indices / adaptive_max_pool3d_with_indices ([26755](, [26772](, [26775](
* Distance functions
* torch::nn::functional::cosine_similarity ([26424](
* torch::nn::functional::pairwise_distance ([26424](

tensor Construction API

* Add support for multidimensional inputs to `torch::tensor` ([26210](, [26890](, [26756](
* From now on, we can use `torch::tensor({{1, 2}, {3, 4}})` in C++ to construct the same tensor as `torch.tensor([[1, 2], [3, 4]])` in Python. Some caveats are noted in [this comment](
* Add support for bool and BFloat16 dtypes to `torch::tensor` ([23337](

Other C++ Improvements

* Add `torch::nn::Module::unregister_module` function, for unregistering a submodule from a `torch::nn::Module` ([26088](

Distributed Improvements

* `torch.distributed` Detect and handle NCCL errors appropriately instead of blocking peers until timeout in `ProcessGroupNCCL` ([25012](, [25905](
* `torch.distributed` Make scatter/gather arguments optional ([25575](
* `torch.distributed.launch` Add a -m flag to allow users to launch python modules ([24910](
* `torch.distributed` Add function to get NCCL version for logging ([26583](
* `torch.distributed` Add timeout parameter to connect function in TCPStore ([26554](
* `torch.distributed` use timeout in connect function to prevent against infinite loop ([26364](
* `torch.nn.modules.batchnorm` Allow SyncBatchNorm to run without DDP in inference mode ([24815](

Performance Improvements

* `torch.argmax/argmin` Rewrite as TensorIterator reductions ([26181](
* `torch.erfinv` Vectorize unary operator ([26629](
* `torch.sin/cos/tan` Use intrinsics for trigonometric functions on CPU ([26431](
* Fix possible deadlock in SharedCache inside a forked child proc ([25158](
* `torch.qr` Fix a regression ([23591](
* `nn.Conv` Use Caffe2's implementation of grouped depthwise 3x3 convolutions ([26556](
* `nn.Conv` Use parallel_for in DepthwiseConvKernel ([26879](
* `nn.Conv` Change shape for conv and unary ops ([25477](
* Fix pin_memory_thread not exiting quickly ([23646](
* Increase predefined_minimum_secs to reduce variation ([23734](
* Enhance Tensor indexSelect performance ([23055](
* Separate input shapes to reduce default execution time ([24136](
* constraints.lower_cholesky Vectorize LowerCholeskyTransform ([24131](
* Speed up an integer to the power of a positive integer on CPU ([26020](
* [ROCm] Enable jit fusion ([22872](
* [ROCm] Use MIOpen for transpose convolutions ([26172](

JIT Improvements

* Enable CPU fused kernel on Windows ([25578](
* Expose an API to iterate all the registered operators ([23207](
* Include recursive class compilations in error call stack ([23454](
* Substantial improvements to saved model format speed and size.
* Compress debug symbols when serializing TorchScript models. ([23659](
* Compress all non-Tensor components of a serialized TorchScript model. ([23723](
* Perform string uniquing by value in pickle serialization. ([23741](
* Implement a bunch of pickle serialization features that optimize for size. ([23759](
* Implement more size-oriented opcodes in the depickler. ([26454](
* Cache node operators to speed up optimization ([24827](
* Allow forward hooks in tracing ([23613](
* Add Pickler C++ API ([23241](
* Open up AliasAnalysisKind for any ops ([23810](
* Add the ability to compile exports on traced modules ([24298](
* Make `NoneType` a subtype of `Optional[T]` ([25361](

ONNX Exporter Improvements

In PyTorch 1.3, we have added support for exporting graphs with ONNX IR v4 semantics, and set it as default. We have achieved good initial coverage for ONNX Opset 11, which was released recently with ONNX 1.6. Further enhancement to Opset 11 coverage will follow in the next release. We have enabled export for about 20 new PyTorch operators. Also, we have focused on enabling the export for all models in torchvision. We have introduced some necessary groundwork for that in this release, e.g., accepting PyTorch models with inputs/outputs of Dict or String. We continue to work on torchvision models, such as FasterRCNN and MaskRCNN, to enable their export.

Adding Support for ONNX IR v4

* Provide an option to exclude the weights from model inputs ([23284](
* Make graph inputs without weights as default ([26146](

Adding Support for ONNX Opset 11

* Introduce ONNX Opset 11 support ([23739](
* Add export for torch.Interpolate in Opset 11 ([24805](, [27179](
* Add export for tensor.gather, tensor.scatter and tensor.scatter_add in Opset 11 ([24790](
* Add export for tensor.clamp in Opset 11 ([25797](
* Add export for torch.topk and torch.sort in Opset 11 ([25739](

Exporting More Torch Operators/Models to ONNX

* Export torch.pixel_shuffle ([23739](
* Export torch.multinomial ([23581](
* Export torch.norm’s frobenius_norm ([23536](
* Export torch.std ([22310](
* Export torch.empty and torch.empty_like ([24166](
* Export torch.rsqrt ([24153](
* Export torch.log1p ([25808](
* Export torch.unique ([25050](
* Export torch.gelu ([24475](
* Export tensor.index_fill and tensor.index_copy ([23052](
* Export torch.round ([26126](
* Export torch.baddbmm ([25738](
* Export torch.remainder ([24410](
* Export torch.cumsum ([24476](
* Export tensor.size with negative axis ([26436](
* Export RNN/LSTM with h0/c0 initial state ([22813](

Enhancing ONNX Export Infra

* Enable exporting PyTorch models which have Dict and String as inputs and outputs ([25889](
* Systematically solving mismatched types caused by implicit type conversion for binary arithmetic operators by adding an ONNX type conversions pass. ([24378](
* Correctly validate dynamic axes names. ([23974](
* Enable ONNX Runtime tests for Opset 10 and partially for Opset 11 ([22993](

Other Improvements

* Error checking: many operators now perform strides check of the output tensor and errors if it contains inner overlaps that would result in incorrect result ([23063](
* `torch.det/logdet/slogdet` Allowing batching ([22909](
* `torch.logical_not` Add new operator ([23839](
* `torch.logical_xor` Add new operator ([23847](
* `torch.symeig` Improve the stability of gradient updates ([23018](
* `torch.eye` Enable for bool and half ([24148](
* `torch.tril / triu` Enable for bool and half ([24163](
* `torch.logical_not/xor` support non-bool tensors. ([23916](, [23978](
* `torch.index_select` Implement indexing methods for sparse tensors ([24937](
* `torch.lu_solve` Enable broadcasting of batch dimensions ([24333](
* `torch.cholesky` Enable batches greater than 262140 ([24438](
* `torch.det` Simplify generation of singular matrices to avoid numerical issue on PowerPC ([25773](
* `torch.erfinv` In the CUDA implementation, use erfinv() for double to preserve accuracy ([25337](
* `torch.erfinv` Add a float version of erfinv on CPU ([26070](
* `` Updates autograd engine to respect streams set in forward ([8354](
* `torch.backends.mkldnn.enabled` Allow disabling MKLDNN at runtime ([25459](
* `torch.cholesky_solve` Add derivative ([26185](
* `torch.cholesky_inverse` Add derivative ([26451](
* `torch.polygamma` Ensure that n is non-negativ`e` ([26294](
* `torch.pinverse` Enable batching ([26095](
* `torch.digamma/trigamma` Fix type mismatches on CUDA ([25791](
* `torch.where` Enable for bool tensor on CUDA ([26430](
* `torch.load` default encoding change to 'utf-8' ([26421](
* `torch.repeat_interleave` Respect the current stream ([26946](
* `torch.bernoulli_` Implement for bool tensors ([25076](
* `torch.norm` Fix nuclear norm with requires_grad=True ([26303](
* `torch.hub.download_url_to_file` Make function public ([26723](
* `nn.modules.conv` add padding_mode to repr ([23996](
* `nn.Transformer` Extend to support BERT (gelu) ([24181](
* `nn.BatchNorm2d` Add support for non-affine batch norm with float stats and half inputs ([22750](
* `nn.Parameter` Fix type hints ([25586](
* `nn.CTCLoss` Improve error message ([26325](
* `nn.Conv` Allow batch size of 0 ([26214](
* `nn.LSTM/GRU` enable double backward for non-cudnn ([26660](
* `optim.Adagrad` Add epsilon argument ([24980](
* `optim.LBFGS`  Change default tolerance_grad to 1e-7 ([25240](
* `optim.lr_scheduler.OneCycleLR` Add new 1cycle learning rate scheduler ([25324](
* `optimizer.step` Fix type annotation ([26930](
* `bfloat16` Add support for sub, mul, and div on CPU ([22851](
* `bfloat16` Enabled comparison ops on CPU ([24182](
* `bfloat16` Enabled masked methods ([24183](
* `bfloat16` Enabled and ([24224](
* `bfloat16` Enable log_softmax and CrossEntropyLoss ([24457](
* `bfloat16` Enabled conv methods ([26167](
* `bfloat16` Enabled dtype on CUDA ([26407](
* `quasirandom.SobolEngine` Use random seed if not specified ([24884](
* `` Add possible out of shared memory error message ([25730](
* `cuda.set_rng_state` Add type hint ([26200](
* Zero sized tensor support for repeat_interleave ([23717](
* Recommend `~` and `bitwise_not()` when user tries to apply neg (`-`) on a bool tensor. ([23621](
* Fix double backward of inplace op on view ([23502](
* `autograd.grad` Validate shapes of outputs ([25349](
* Enable libflame as a LAPACK choice ([25795](
* Fix race condition in CUDA initialization ([25788](
* Include `iteration_` in SGD optimizer serialization ([26906](
* [C++] `torch::tensor` Fix an ambiguous overload issues in constructor ([26890](
* [XLA] Check device before accessing data_ptr in PackLayer ([26056](
* [XLA] Allow overwriting catch-all kernels ([25947](

Bug Fixes

TensorBoard Bug Fixes

* `SummaryWriter.add_graph`: Fix empty graph output in some cases ([25599](
* Update Caffe2 contrib TensorBoard logging to not require TensorFlow ([25259](
* `SummaryWriter.make_video`: Fix write_gif call to moviepy for newer lib ([21218](

C++ API Bug fixes

* Fixes mismatch of device and data type when computing `step_size` in LBFGS optimizer ([25909](


* Fix list comprehension that change the type of the original iterable ([24271](
* Fix double copying of constants during recursive scripting ([24412](
* Fix frontend error message ([23576](
* Clear recursive error stack on each compilation ([23458](
* Fix bugs in assignment to optionals ([25059](
* Make `torch.jit.Attribute` work when `PYTORCH_ENABLED=0` ([23851](
* Fix unicode in comments causing compilation errors ([24218](
* Correctly raise an error if an `nn.Module` has not been initialized but you try to script it ([24852](
* Fix annotated assignment to variables ([25094](
* dictPop: dereference dict.find() iterator before calling dict.erase() ([25056](
* fix closures which always throw. ([25278](
* Add source location to class instantiation error ([24990](
* Fix `AliasAnalysisKind::PURE` on MSVC ([25375](
* Emit script function calls during tracing. ([25089](
* Resolve `NamedTuple` types properly in Python ([26443](
* Fix schema matching of tuples to vartype lists ([25944](
* Correctly preserve ignored function return value type ([25262](
* Fix missing newline in compiled from source range highlight ([25802](
* Fix use-after-free bug in `optional` ([25965](
* Fix torch.arange traced as constant ([25363](
* Preserve module names in recursive script ([24505](
* Properly resolve ignored module method type annotations ([26683](
* Make `is_optional` check more robust ([26312](
* Fix builtin lookup for Python functions ([26688](
* Typevar matching fix + implicit conversions from Scalar to int/float ([26453](
* Fix range for non-int inputs and pow implementation ([26926](

Other Bug Fixes

* `torch.is_pinned` pin_memory should not copy on already pinned tensors ([23484](
* `torch.cdist` Fix incorrect gradients on CUDA non-batch tensors ([22915](
* `torch.from_numpy` Fix failure on windows for int32 ([25139](
* `torch.tensor` Fix memory leak creating a tensor from numpy ([24267](
* `torch.index` Don't save `self` in `index` backward ([25594](
* `torch.bincount` Fix int32 overflow on CUDA ([25748](
* `torch.bernoulli` Fix the distribution sampler ([26864](
* `torch.pow` Fix precision ([25476](
* `torch.cdist` Fix gradient computation when first arg is 1xn ([26254](
* `torch.scatter_add_` Fix scatter CPU kernel when (input size, src size) > index size ([25839](
* `nn.ConvTranspose2d` Fixed an error with float16 inputs and weights on CUDA.  ([23552](
* `nn.CTCLoss` Fix zero-length targets on CUDA ([23298](
* `nn.Conv2d` Correct an overflow in an error message ([25146](
* `optim.Adam` apply a small mathematical fix. ([23737](
* `dataloader` Fix IndexError on shutdown if not all workers are started ([23761](
* `Tensor.repeat` Fix crash on for 0 repeats ([23766](
* `torch.pin_memory` only use one thread ([25111](
* `distributions.Uniform,HalfCauchy,Gamma` Fix `log_prob` when value is a float ([23017](
* Fix typing error for Padding with asymmetric signatures ([24895](
* Avoid race condition in `intrusive_ptr.reset_()` ([24464](
* `torch.hub`: Fix SSL cert issue for hub in Python 2 ([25042](
* Fix int overflow issue in CUDA kernels. ([24818](
* `Module.cuda` Fix type hints ([25018](
* Fix bug in assertNotEqual for int tensors ([25412](
* Fix 'in' return true incorrectly ([24156](
* Fix bugs in bulk loader when `batch_size=None` or with namedtuple ([26065](
* Fix serialization issue in big endian arch ([26383](
* Fix `Vec256::abs()` for floating point when applied on -0.0 ([26422](
* Fix cyclic reference in _LRScheduler ([25776](
* Fix a build failure on s390x ([26233](
* [XLA] Fix tensor construction from array ([24283](

Documentation Updates


* `torch.distributed` Error phrasing in torch.distributed helper functions ([25574](
* `torch.distributions.negative_binomial` clarified ambiguous doc string in NegativeBinomial ([25923](


* Add technical documentation for the serialization format ([23456](
* Fix trace docs ([24191](
* Add `trace_module` to docs ([24258](
* Cleanup distinction around `script` and `trace` ([24208](
* Fix `item()` call in docs ([25404](
* Misc doc updates / fixes ([24371](, [24445](

Other documentation improvements

* `torch.record_stream` Add documentation ([24078](
* `torch.fold` Describe the relation between fold and unfold operations ([24840](
* `torch.argmax` Fix incorrect doc ([23775](
* `torch.random` add docs ([23553](
* `torch.empty_strided` Add docs ([23735](
* `torch.bitwise_not` Document for bool tensors ([23800](
* `torch.cdist` Add documentation ([25221](
* `torch.where` Update parameter names in doc ([25554](
* `torch.atan2` Clarify and correct the doc ([26180](
* `nn.functional.bilinear` Added documentation ([24951](
* `nn.functional.upsample` Fix align_corners doc ([23707](
* `nn.Transformer` Fixed an error in the example ([24837](
* `optim.lr_scheduler.CosineAnnealingWarmRestarts` Add documentation ([25421](
* `optim.SGD` Updated with subscripts ([23985](
* `optim.RMSprop` Highlighting in the doc that square root comes before adding epsilon ([26735](
* `autograd.detect_anomaly` Add a warning ([26615](
* Improve dataloader docs on when auto-batching is disabled ([23671](
* Updated docs and added deprecation warnings to acknowledge a bool tensor ([22261](
* Document benchmarking practice for CUDA ([23910](
* Add ASAN instructions to ([24848](


<td><sub><pre lang="python">
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2))
RuntimeError: inconsistent tensor, output size ([2]) is not the same as broadcasted mean and std size (3)

`Tensor.geometric_` no longer supports integral Tensors ([31878](

Previously, on CPU devices, `Tensor.geometric_` supported Tensors with integral dtype. Now, it only supports floating point. We removed support for this because it doesn’t make sense for `geometric_` to operate on integral dtypes.

Changed `torch.floor_divide` `input` positional argument name to `self`  ([34552](

Before PyTorch 1.5, `torch.floor_divide` took two positional arguments: `torch.floor_divide(input, other)`. We’ve changed the name of the `input` argument to `self`; this will break code that called `torch.floor_divide` via keyword argument. For example:

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
torch.floor_divide(input=x, other=y)
<td><sub><pre lang="python">
Either of the following works.
torch.floor_divide(self=x, other=y)
torch.floor_divide(x, y)


RNN / GRU / LSTM layers ([34322](

* Instead of returning `RNNOutput`, RNN / GRU `forward` method now returns `std::tuple<Tensor, Tensor>`, and LSTM `forward` method now returns `std::tuple<Tensor, std::tuple<Tensor, Tensor>>`, matching Python API.
* LSTM forward method’s hidden state parameter now has type `torch::optional<std::tuple<Tensor, Tensor>>`, matching Python API.
* RNN / LSTM / GRU layers now have `forward_with_packed_input` method which accepts `PackedSequence` as input and optionally hidden state, matching the `forward(PackedSequence, ...)` variant in Python API.
* RNN / LSTM / GRU layers no longer have these fields: `w_ih` / `w_hh` / `b_ih` / `b_hh`. Instead, to access the weights and biases of the gates, users should do e.g. `rnn->named_parameters()["weight_ih_l0"]`, which mirrors the Python API `rnn.weight_ih_l0`.
* In `RNNOptions`
* `tanh()` / `relu()` / `activation` are removed. Instead, `nonlinearity` is added which takes either `torch::kTanh` or `torch::kReLU`
* `layers` is renamed to `num_layers`
* `with_bias` is renamed to `bias`
* In `LSTMOptions`
* `layers` is renamed to `num_layers`
* `with_bias` is renamed to `bias`
* In `GRUOptions`
* `layers` is renamed to `num_layers`
* `with_bias` is renamed to `bias`

Upsample layer / F::interpolate function ([35025](

* There are changes to `UpsampleOptions` and `InterpolateFuncOptions`:
* `size` is changed from `std::vector<int64_t>` to `c10::optional<std::vector<int64_t>>`. If you want to pass a list of `int64_t` to this argument, you must pass it as `std::vector<int64_t>`.
* `scale_factor` is changed from `std::vector<double>` to `c10::optional<std::vector<double>>`. If you want to pass a list of `double` to this argument, you must pass it as `std::vector<double>`.
* F::multilabel_margin_loss / F::multilabel_soft_margin_loss functions ([35163](
* `torch::nn::functional::MultiLabelMarginLossFuncOptions` is renamed to `torch::nn::functional::MultilabelMarginLossFuncOptions`
* `torch::nn::functional::MultiLabelSoftMarginLossFuncOptions` is renamed to `torch::nn::functional::MultilabelSoftMarginLossFuncOptions`
* The deprecated `torch::nn::BatchNorm` is removed in favor of `torch::nn::BatchNorm{1,2,3}d`
* The deprecated `torch::nn::FeatureDropout` is removed in favor of `torch::nn::Dropout{2,3}d`
* The deprecated `torch::nn::modules_ordered_dict` is removed. User should do `Sequential sequential({{"m1", MyModule(1)}, {"m2", MyModule(2)}})` instead.
* The deprecated `torch::nn::init::Nonlinearity` is removed, in favor of these enums: `torch::kLinear `/ `torch::kConv1D` / `torch::kConv2D` / `torch::kConv3D` / `torch::kConvTranspose1D` / `torch::kConvTranspose2D` / `torch::kConvTranspose3D` / `torch::kSigmoid` / `torch::kTanh` / `torch::kReLU` / `torch::kLeakyReLU`
* The deprecated `torch::nn::init::FanMode` is removed, in favor of these enums: `torch::kFanIn` / `torch::kFanOut`


* `Optimizer::step` now accepts closure function as optional input and returns a tensor, and `LossClosureOptimizer` is removed (34790) (34957). If you had a custom optimizer class defined as:

struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
void step() override {...}

* you would need to update your optimizer class definition as follows:

struct MyOptimizer : Optimizer {
using Optimizer::Optimizer;
torch::Tensor step(LossClosure closure = nullptr) override {
// return `torch::Tensor()` if `closure` is nullptr
// (i.e. we are not computing the loss)
return torch::Tensor();

* Adagrad ([29335](
* In `AdagradOptions`, `learning_rate` is renamed to `lr`.
* In `Adagrad`, `sum_buffers` and `step_buffers` are now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:

auto& param_state = static_cast<AdagradParamState&>(

// Use the following to access parameter state:
// param_state.sum()
// param_state.step()

* SGD ([32592](
* In `SGDOptions`, `learning_rate` is renamed to `lr`.
* In `SGD`, `momentum_buffers` is now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:

auto& param_state = static_cast<SGDParamState&>(

// Use the following to access parameter state:
// param_state.momentum_buffer()

* Adam ([33730](
* In `AdamOptions`:
* `learning_rate` is renamed to `lr`
* `beta1` and `beta2` are replaced by a tuple `betas`
* In `Adam`, `step_buffers`, `exp_average_buffers`, `exp_average_sq_buffers` and `max_exp_average_sq_buffers` are now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:

auto& param_state = static_cast<AdamParamState&>(

// Use the following to access parameter state:
// param_state.step()
// param_state.exp_avg()
// param_state.exp_avg_sq()
// param_state.max_exp_avg_sq()

* RMSprop ([33450](
* In `RMSpropOptions`:
* `learning_rate` is renamed to `lr`
* In `RMSprop`, `square_average_buffers`, `momentum_buffers` and `grad_average_buffers` are now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:

auto& param_state = static_cast<RMSpropParamState&>(

// Use the following to access parameter state:
// param_state.square_avg()
// param_state.momentum_buffer()
// param_state.grad_avg()

* LBFGS ([34564]( ([34957](

* In `LBFGSOptions`:
* `learning_rate` is renamed to `lr`
* `max_eval`‘s type is changed from `int64_t` to `c10::optional<int64_t>`
* `tolerance_grads type` is changed from `float` to `double`
* `tolerance_change type` is changed from `float` to `double`
* `history_size type` is changed from `size_t` to `int64_t`
* In `LBFGS`, `d`, `H_diag`, `prev_flat_grad`, `t`, `prev_loss`, `ro`, `al`, `old_dirs`, `old_stps`, `func_evals` and `state_n_iter` are now removed, and parameter state should be accessed by calling the accessors on the parameter’s corresponding state object. For example:

auto& param_state = static_cast<LBFGSParamState&>(

// Use the following to access parameter state:
// param_state.d()
// param_state.H_diag()
// param_state.prev_flat_grad()
// param_state.t()
// param_state.prev_loss()
// param_state.old_dirs()
// param_state.old_stps()
// param_state.func_evals()
// param_state.n_iter()

Removed `AutoGIL/AutoNoGIL` in favor of `pybind11::gil_scoped_*` functions ([34301](

If your code released or acquired the GIL via AutoNoGIL or AutoGIL, please change the invocations to `pybind11::gil_scoped_release` or `pybind11::gil_scoped_release`, respectively.


* `torch::tensor(floating-point values)` will always produce tensor of default dtype, and `torch::tensor(integer values)` will always produce tensor of `torch::kLong` dtype, matching Python API behavior ([32367](
* `torch::Tensor::base()` is renamed to `torch::Tensor::_base()` , matching Python API. (33316)
* Renamed TensorTypeId to DispatchKey ([32154](
* Throw an error if nbytes is called on a sparse tensor. ([33897](


Simple Executor Is Now On By Default

The simple executor skips the number of fusion-related passes and analyses that are very time-consuming. Disabling these optimizations fixes pathologically long compilation times. The users that rely on GPU fusion to have their desired performance profile, should turn on the profiling executor. We provide C++ and python API to enable the profiling executor:

* in python, call `torch._C._jit_set_profiling_mode(True)` before you call your model for the first time.
* in C++, include `include <torch/csrc/jit/runtime/graph_executor.h>` and set `getProfilingMode() = true` before you invoke your model for the first time.


**Remove qconfig_dict in top level eager mode quantization API** ([31972](

In eager mode quantization, one needs to manually insert quant and dequant stubs in a model to specify where activations are quantized. Having a qconfig_dict that specifies the quantization configuration for each module is not useful as one needs to manually modify the model with quant/dequant stubs. The new API makes it explicit that the model needs to be manually modified for quantization.

previously qconfig_dict was an optional argument to prepare
def prepare(model, qconfig_dict=None, inplace=False):

now replaced with
def prepare(model, inplace=False):


Functional API for Distributed Autograd and Distributed Optimizer

More specifically, callers must pass `context_id` to `torch.distributed.autograd.backward()` and `torch.distributed.optim.step()`.  ([33711](

import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer

with dist_autograd.context() as context_id:
Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
Backward pass.
dist_optim = DistributedOptimizer(
[rref1, rref2],

import torch.distributed.autograd as dist_autograd
import torch.distributed.rpc as rpc
from torch import optim
from torch.distributed.optim import DistributedOptimizer

with dist_autograd.context() as context_id:
Forward pass.
rref1 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 3))
rref2 = rpc.remote("worker1", torch.add, args=(torch.ones(2), 1))
loss = rref1.to_here() + rref2.to_here()
Backward pass.
dist_autograd.backward(context_id, [loss.sum()])
dist_optim = DistributedOptimizer(
[rref1, rref2],


Disallow sending CUDA tensors over RPC

The motivation is to prevent potential invalid device errors when the number of devices on the sender and the receiver does not match. However applications, can always move CUDA tensors to CPU before sending (33604).

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x, 3))
<td><sub><pre lang="python">
import torch
import torch.distributed.rpc as rpc
rpc.init_rpc("worker0", rank=0, world_size=2)
x = torch.zeros(2, device=0)
ret = rpc.rpc_sync("worker1", torch.add, args=(x.cpu(), 3))

New Features


Added new functional autograd API ([34066](

* See Highlights for more details

New `__torch_function__` API Override Mechanism ([30730](, [32194](, [32799](, [34240](, [34303](

We introduced `__torch_function__`, an API override mechanism for subclassing `torch.Tensor` in Python. This is useful for creating custom objects that implement the `torch.*` APIs. These currently support overriding most `torch.*`, and `torch.nn.functional` APIs; we’ve also planned future support for subclassing `torch.Tensor` (see tracking issue [22402](

New Operators

* `torch.logical_and` and `torch.logical_or` operations added ([30521](
* `torch.square` added ([30719](
* `torch.bitwise_and` added ([31104](
* `torch.cummax`, `torch.cummin` added ([32169](, [32238](, [32537](, [33492](
* `torch.floor_divide` ,  `Tensor.floor_divide` added ([30493](, [34552](
* `torch.true_divide` , `Tensor.true_divide` added, analogous to Python 's, and NumPy's (true) division ([34236](, [34794](
* `nn.functional.hardsigmoid` added([34545](
* Added PCA and SVD for low-rank matrices (`torch.pca_lowrank`,  `torch.svd_lowrank`), `torch.lobpcg` for positive-defined generalized eigenvalue problem ([34721](


* `distributions.von_mises` added ([33418](
* `distributions.mixture_same_family` : Added support for mixture distributions ([22742](, [33408](
* `distributions.transforms.TanhTransform`  added([19785](
* `distributions.continuous_bernoulli` added ([34619](


* NN modules / functionals
* `torch::nn::MultiheadAttention` ([27309](
* `torch::nn::RNNCell` / `LSTMCell` / `GRUCell` ([34400](
* `torch::nn::AdaptiveLogSoftmaxWithLoss` ([29076](
* `torch::nn::utils::rnn::PackedSequence` / `pack_padded_sequence` / `pad_packed_sequence` / `pack_sequence` / `pad_sequence` ([32387](, [33652](, [34185](
* C++ tensor indexing ([30424](, [32841](, [30427](, [34255](
* Please see docs:
* Operators
* C++ API parity: `isinf` ([31099](
* Autograd
* Add `at::Tensor::retain_grad` API ([33349](
* C++ extensions
* Add option to use ninja to compile ahead-of-time `cpp_extensions` (32495, [33084](
* Added support for Pytorch C++ extensions to use HIP ([32669](


* Allows Python application to create subclass of C++ `c10d.Store` using pybind11 trampoline class  [30415](


* Loading module from android asset ([30378](
* Torchscript print to logcat ([31456](


* qnnpack TanH ([31013](
* Adding quantized clamp kernel ([30541](
* Quantized H Tangent function ([31031](
* QNNPACK: Add support for dynamic quantization. ([31896](
* Add operator support for dynamic quant on mobile ([32479](
* Adding native qconcat ([32252](
* FP16 dynamic quantized Linear ([32331](
* Add support for Dynamic LSTM quantization on Mobile ([32757](
* Quantized sigmoid function ([31851](
* Quantized leaky relu ([33004](
* Add a quantized batch_norm operator ([33080](
* Add Quantized BatchNorm2d module ([33109](
* Add the 3d avg pool for video related model ([33339](
* Add quantized_hardtanh ([34097](
* Add quantized ELU activation ([34267](
* Add the 3d upsample quantized op for video model ([34594](
* Add the quantized batch_norm3d and also batch_norm3d fused with relu operators ([34702](
* Add quantized implementation of hard sigmoid ([34607](


* [Experimental] Enable autograd profiler to work with RPC ([31381](, [34398](, [30677](, [31346](, [31380](
* [Experimental] Allow calling remote TorchScript functions using RPC ([32466](, [33190](, [32990](, [32959](,  [33526](, [33992](, [33582](, [32197](, [33329](, [34183](



*  `nn.RNN`: Ensure MIOpen is called on same stream as operator ([30672](
* Fixed asserts in CUDA kernels ([31276](, [31297](
* Enable BFloat16 support for convolutions ([30948](
* Abstract atomic add calls ([31992](
* Install complete set of headers for ROCm build ([32076](
* Adjust `elementwise_kernel` settings on ROCm ([32609](
* `nn.BatchNorm{1,2,3}d`: Use `C10_WARP_SIZE` to fix functionality on HIP vs CUDA for gradient computation ([33098](
* Enabled Bfloat16 type for activation functions and `batch_norm` ([32065](
* Added ability to enable/disable MIOpen at runtime ([33118](
* Enable BFloat16 type for pooling ops ([34166](
* `torch.pdist`: improved precision by enabling double `__shfl_down` ([34103](
* Enabled BFloat16 type for loss functions and few misc ops required for resnet50 ([34469](
* Enabled BFloat16 type for EmbeddingBag, Index, and Sigmoid ops ([34630](
* Enabled 3D batch norms through MIOpen ([33262](
* Enabled 3D convolutions through ROCm ([33067](
* `nn.RNN`: Check if weights need to be flattened ([34265](


* NN modules / functionals
* Allow skipping default arguments in module's forward method when module is used in `torch::nn::Sequential` ([33027]( ([33718](
* Make `torch::nn::Sequential::push_back(AnyModule)` methods public ([34208](
* Refactor RNN / GRU / LSTM layers to match Python API ([34322](
* For `Conv{1,2,3}d`, `padding_mode` now accepts `torch::kZeros` / `torch::kReflect` / `torch::kReplicate` / `torch::kCircular`, matching Python API behavior. ([35023](
* Fix `F::interpolate` and `torch::nn::Upsample` implementation to match Python API behavior ([35025]( ([36274](
* Renaming: MultiLabelMarginLossFuncOptions -> MultilabelMarginLossFuncOptions, MultiLabelSoftMarginLossFuncOptions -> MultilabelSoftMarginLossFuncOptions ([35163](
* Optimizers
* All existing optimizers in the C++ API (Adagrad / SGD / Adam / RMSprop / LBFGS) have the following changes to achieve parity with the Python API: ([29335]( ([30739]( ([32592]( ([33730]( ([33450]( ([34790]( ([34564]( ([34957]( ([35001]( ([36033]( (36245)
* step function implementation is changed to behave the same as Python equivalent
* Constructor now accepts `std::vector<OptimizerParamGroup>` as input
* `optimizer.add_param_group(...)` can be used to add parameter group to an existing optimizer
* `optimizer.state()` should be used to access parameter state
* autograd
* Renamed `at::Tensor::base()` to `_base()`, matching Python API (33316)


* Allow TCPStore to pick a port to bind to ([31674](
* Enhance NCCL watchdog to actively abort communicators for timed out ops ([32338](
* Adding DDP Design Note ([32158](
* Recommend using DDP over DataParallel ([35063](


* `distributions.independent`: added explicit string representation ([33676](
* `categorical.sample`: Reduced memory overhead ([34900](
* `distributions.MultivariateNormal`: improved numeric stability and performance ([32092](


* Add module level qpl logging. ([30906](
* Expose setNumThreads to android api ([31033](
* remove unused SparseCPUType from mobile build ([33517](
* make sure mobile build work with dynamic dispatch ([34038](
* support for custom mobile build with dynamic dispatch ([34055](
* Add watchOS support ([33318](
* speed_benchmark_torch switch to log latency from dataset level to row level ([34598](


****Exporting More Torch Operators to ONNX****

In PyTorch 1.5, we have added support for 10 additional operators and also enhanced support for another set of 10+ existing operators. We have also added support for exporting large models (> 2GB) to ONNX. Additionally, we have made enhancements and optimizations to the export of ScriptModules and will continue to do that in the next release. We have also made improvements to the custom op export experience.

* Export dynamic unbind, split and getitem ([29136](
* Export torch.new_zeros ([34077](
* Export Im2col ([30972](
* Export bitwise_not for bool ([28439](
* Export logsoftmax with dim != -1 ([30433](
* Export einsum ([32716](
* Export aten::copy_ and aten::index_put to ONNX opset 11 ([26941](
* Export floor_divide ([31081](
* Export one_hot ([34454](
* Export torch.take ([33061](
* Export bool type index mask ([32445](
* Export split with list of sizes ([33161](
* Export scalar tensor for split ([32493](
* Export flatten to accept negative indices in opset 11 ([30751](
* Export sort with negative axes ([31971](
* Export Interpolate to support scale ([28324](, [31526](, [32554](
* Export quantized concat ([30887](

****Enhancing the Support for ScriptModule****

* Fixed access to element in size tensor for scripting  ([32652](
* Export Conv in TorchScript module ([30618](
* Export Dim operation in TorchScript module ([31928](
* Export randnlike in TorchScript module ([32830](
* Partially support tensor lists in loop/concat/stack ([30126](

****Enhancing Existing Export Logic****

* Updating ONNX checker logic. ([33522](
* Adding ONNX large model export support in  exporter ([33062](
* Extend op registration ([32943](
* Support op registration if name starts with underscore ([32017](

****Optimizing Exported ONNX Graph****

* Try exporting ONNX with force_outplace=False ([29466](
* Enable constant folding ([29834](
* Added cons folding for ONNX mul, div, sqrt ops ([32077](
* Enable constant folding for Reshape ([31054](

****Adding Utility Functions and Refactoring****

* Added ONNX model checker to ONNX export ([32298](
* Export custom ops ([29752](
* Upgrade exported ONNX IR version to 6 ([31025](
* Provide names for operator nodes in ONNX exported graph ([27342](
* Update ONNX landing page since 1.3 ([32805](
* Turn ONNX_ML into a proper build option ([33424](

Operator Benchmark

* Added small input shapes to test operator overhead ([30617](
* Added `binary_test` to benchmark binary ops ([31326](
* Added `Tensor.copy_` operator ([31327](
* Removed option to wipe cache because it did not help with variance ([31334](
* Added `torch.diag` ([32597](


* Guard against copying from quantized Tensor to non-quantized Tensor ([29660](
* Add assert for min, max, qmin, qmax for ChooseQuantizationParams ([32739](
* Support broadcast for quantized mul kernel ([30442](
* Make FakeQuant use `REGISTER_DISPATCH` ([33682](
* Set alias analysis kind to `FROM_SCHEMA` for qadd, qmul, qclamp, qconcat ([33359](
* Migrate `fake_quant_slice` to TensorIterator ([33744](
* Parallelize quantize and dequantize ([33765](
* Make FP16 RNN use new prepack op ([34339](
* Refactor QAT Conv module for better extensibility ([30362](
* Use non-inplace for insert observer pass ([34190](


* Add default arguments for `init_method` ([30208](
* By default ignore RRef leaks during shutdown ([30217](
* Robustify `rpc_agent` handlers with generic Future ([31224](
* Fix error message in incorrect `rref.localValue()` call ([31199](
* Add `RpcAgent::getWorkerInfos()` API to return all `WorkInfo`s in the group ([30241](
* Add local shutdown to process group agent ([30330](
* Add `RRef.str()` API to return a string representation of the RRef ([30609](
* Adding Debug Info for RRef Context ([30610](
* Add `get_metrics` and `get_debug_info` to RPC agent ([30833](
* Adding debugging metrics to process group agent ([30884](
* Add glue code to collect debug info from all components ([30888](
* Make RRef leak detection always print a warning log ([31922](
* Allow multiple backward passes to accumulate gradients. ([32506](
* Allow RRef local creation with IValue objects ([33263](
* Improve ProcessGroup `RpcBackendOptions` Constructor API ([34081](
* Enhanced Error Reporting in Dist Autograd/RPC ([34179](
* Delete all user forks tracked in `RRefContext` before graceful shutdown ([31893](
* Best-effort Error Detection for Using Deleted UserRRefs ([34673](
* Don't run user function until all UserRRefs in the args are confirmed ([34497](
* Support using self as the destination in `rpc.remote` for builtin operators ([34931](
* Add debug info API for distributed autograd. ([30642](
* Propagate errors in `clearAndWaitForOutstandingRpcsAsync`. ([32952](

Type Hints

* DataLoader `default_collate` type hint added ([28935](
* `Tensor.rsub, Tensor.rpow, Tensor.rtruediv, Tensor.map_` type hints were added ([30576](
* `torch.optim`: added more missing type hints ([31130](
* `nn.functional.grid_sample`, `nn.functional.affine_grid`: added missing align_corners annotation ([32492](
* `torch.nn.Parameter` constructor type hint was fixed ([32617](
* `nn.MultiheadAttention`, `nn.Transformer`: added type hints ([28396](
* `torch.optim.LambdaLR` constructor type hint was fixed ([33271](
* `torch.optim`: added missing default value for `LRScheduler.step()` ([32411](
* Make type of `Tensor.type()` more specific ([32353](
* `torch.optim.optimizer.Optimizer`  type hints were fixed ([32900](
* `optim.AdamW` type hints were fixed ([34299](
* ``  subclasses type hints were added ([33679](
* `nn.Sequential`, `nn.ModuleList`, `nn.ParameterList`, `nn.ParameterDict` type hints were fixed ([33686](
* `Tensor.bfloat16()` type hint was added ([33747](
* Binary operator type hints were fixed ([33748](
* `torch.bfloat16`, ``, `Tensor.cuda`, and 10s of other type hints added ([33762](
* `torch.add` type hint was fixed([33935](
* `Tensor.shape` type hint was fixed ([34595](
* Fixed `` imports ([33543](
* `Tensor.__radd__` type hint was fixed ([35231](


* `autograd.detect_anomaly`: added support for Sparse Tensors ([29803](
* `autograd.detect_anomaly`: Error messages now print the current Node name ([33875](
* `autograd.profiler`: added better error message when crashing while profiling multi-worker DataLoader ([31473](
* `autograd.profiler` Enable using `torch.autograd.profiler.record_function` as decorator ([30861](
* `autograd.profiler` Speed up `export_chrome_trace` by up to 4x ([30724](
* `torch.autograd`: added better error message when attempting to fork ([33885](
* `torch.cuda.memory.caching_allocator_alloc`, `torch.cuda.memory.caching_allocator_delete` exposed in Python API ([33860](
* `torch.roll`: added bool tensor support ([31194](
* `torch.flip`: added support for bool tensors ([31267](
* `torch.equal`: added support for bfloat16 CPU scalar types ([30817](
* ``, `torch.load`: added error message for minimum dill version support ([30985](
* `torch.diagonal`: added named tensor support([30193](
* `torch.linspace`: added support for integral types on CPU ([32218](
* `torch.eig`: Added autograd support in the case where eigenvalues are real ([33090](
* `torch.mvlgamma`: improved error message ([32665](
* `torch.no_grad`, `torch.enable_grad`: added support for decorating generator functions ([31792](
* `torch.narrow`: added Tensor overload for `start` ([34317](
* `Tensor.random_`: enabled support for half on CPU ([34030](
* `Tensor.grad`: added warnings when accessing it if it won't be populated for known reasons ([30531](
* `torch.cuda.comm.gather`: improved error message ([27456](
* `nn.functional.max_pool{1,2,3}d`: added named tensor support ([31669](
* `nn.Module.load_state_dict`: Include the contents of the exception in error messages ([32693](
* `nn.MultiheadAttention`: add support for 3D attention mask ([31996](
* `nn.MSELoss` : Added performance warning for using CPU Half ([33021](
* `nn.ModuleList`, `nn.ParameterDict`, `nn.ParameterDict`: added more descriptive error messages when attempting to call these like Modules ([29991](
* `nn.init.dirac_`: Added `groups` option for compatibility with initializing group convolutions ([32825](
* Added error message to indicate that reduction operations are not supported for dim >= 64 ([31476](
* Type Promotion: added supports for sparse tensors and arithmetic operations ([30429](
* Enabled indexing for bfloat16 tensors ([31692](
* Add 64-bit indexing support for CUDA Tensors ([33405](
* Added warning when converting a read-only NumPy array to `torch.Tensor` ([33615](
* Set rpath for JNI library on Mac ([32247](
* Updated MAGMA to 2.5.2 for Windows ([30513](, [34205](
* Marked PyTorch incompatible with Python-3.6.0 ([34724](
* Consider `hub_dir` alongside `TORCH_HOME` env variable for storing hub models ([32844](
* Improved dll loading logic on Windows ([33856](
* Error out if legacy ` ` is called on alternate layouts or dtypes ([31485](
* `utils.checkpoint.checkpoint_sequential`: Removed deprecated variadic arguments behavior ([25985](

Bug Fixes


* NN modules / functionals
* `output_ratio` for `FractionalMaxPool{2,3}d `module and `fractional_max_pool{2,3}d` functional should accept double as data type ([33304](
* For `AdaptiveAvgPool{2,3}d `and `AdaptiveMaxPool{2,3}d`, `output_size` is changed to accept `c10::nullopt` in its elements, matching Python API behavior. ([35022](
* Fix bug in `fractional_max_pool3d_with_indices` implementation ([35024](
* Remove `namespace F = torch::nn::functional` from torch/nn/modules/batchhnorm.h, so that people don't have to use `F` to alias `torch::nn::functional` if they don't want to ([30684](
* autograd
* For `AutogradContext`, `get_dirty()` is removed and `get_and_bump_dirty()` is added, and the latter always bumps the version counter of the returned tensors (33068)
* Fix allow_unused checking for C++ API (34035)
* Remove `using namespace torch::autograd` from `torch/csrc/api/include/torch/nn/modules/_functions.h` ([34423](
* Operators
* `torch::tensor(floating-point values)` will always produce tensor of default dtype, and `torch::tensor(integer values)` will always produce tensor of `torch::kLong` dtype, matching Python API behavior ([32367](
* Fix `torch::allclose` to handle `std::numeric_limits::lowest()` for integral types ([32978](
* Switch `torch::empty_like` to use `merge_in` to process TensorOptions (33505)


* Allow DDP to detect globally unused parameters ([28883](
* Accept url query when `rank` or `world_size` is specified in Process Group `init_method` URL ([32016](
* Add ability to abort NCCL communicators from the store. ([32895](
* Fix timeout support when initializing process group with TCP store ([33434](
* Abort NCCL communicators before throwing operation timed out ([31128](
* Fix logging for aborted communicators in ProcessGroupNCCL ([33147](
* Fix handling of replica parameters in DataParallel ([33907](
* Specify `requires_grad` for Parameter replica so it's not always set to True by default ([32356](
* Put sparse `allreduce` results to input tensors ([32226](
* Issue a warning when `zero_grad` is used in `DataParallel` ([33064](


* TorchScript compilation fixed for ([33783](
* `torch.stft`
* ``,
* `torch.lu_unpack`
* `torch.cdist`
* `torch.norm`
* `tensor.tolist()` compilation now supported, requires output type annotation ([33472](
* def foo(float_matrix, scalar_ten):
type: (Tensor, Tensor) -> Tuple[List[List[float]], bool]
out1 : List[List[float]] = float_matrix.tolist()
out2 = torch.jit.annotate(bool, scalar_ten.tolist())
return out1, out2
* `torch.rand_like` and other `_like` constructors no longer require additional arguments in TorchScrip it
* Compilation for `nn.Module` APIs added [(29495)](
* `children`
* `named_children`
* `modules`
* `named_modules`
* Support for ModuleList Indexing with Integer Literal ([29236)](
* Fixed flipped outputs for `PackedSequence`  ([32955)](
* Support `index` and `type` properties on `Device` ([32953](
* `device.index`
* `device.type`
* Add remaining `Tensor` properties ([33906](
* `tensor.ndim`
* `tensor.T`
* ``
* `tensor.is_leaf`
* Fix augmented assignment to non-tensor attributes [32993](
* Fixed type resolution for function arguments [29623](
* Previously we resolved types by parsing their names directly, but now TorchScript uses the value of the type directly from Python
* This allows types types like `torch.device` to be used
* `len` on tuples containing different types [35768](


* Fix exception message in Java Tensor ([30205](
* Fix the crashes for c++ not able to find java class through Jni ([30390](
* Add DoNotStrip to nativeNewTensor method. ([30472](
* GenericDict/List type use unshapedType() ([30428](
* Support tensors with a storage offset in Java ([31584](
* Fix SIGABORT caused by double exception in PyTorchStreamReader when file not found. ([33243](
* Fix `SELECTED_OP_LIST` file path issue ([33942](
* Fix for handling batch size 0. ([34599](
* fixed AutoGradMode/AutoNonVariableTypeMode uses for mobile callsites
* Use `gettimeofday` on iOS ([30361](


* Fix `weight_norm` export for dim=0 ([31015](
* Fix for constant folding flaky tests ([32546](
* Fix export for avg_pool with default stride  ([33017](
* Fix ONNX CI by moving test data to aws ([33200](
* Fix for random generators export ([33789](
* Fix export of index_put  ([31552](
* Fix for expand -1 dim value ([34069](
* Reduce ONNX test time on CI ([33242](
* ONNX Error Message on Missing Op ([33593](
* Fix exporting `copy_` with index as tensor input ([32801](
* Fix for `rand_like` as well ([33095](
* Added torchvision tests as part of ORT tests ([31835](
* Remove non-ascii character from `torch/onnx/` ([31814](
* Add flag to enable script tests ([32654](
* Skip same tests in ONNX Python3 CI as in Python2 ([31827](
* Fixed `` export ([34794](
* Fixed `aten::size` for opset 11 ([35984](


* Bug fix: Handle missing keys in observer state dict during load ([30357](
* Fix BC for quantized linear ([30481](
* Fix mapping white list to avoid attaching qconfig for DeQuantStub ([30636](
* Fix default instantation of dynamic quantized LSTM ([31433](
* Use default scale/zero_point in fake_quantize module instead of None ([32318](
* Fix ASAN / potential segfault in quantized Tensor memory allocations. ([29882](
* Don't serialize None values in observer ([32733](
* Enable inplace relu fusion for training ([33105](
* Bug fix in dynamic quantization kernels + better test coverage. ([33320](
* Run weight_post_process for QAT ([33852](
* Fix histogram observer to work with QAT on GPU ([34232](
* Fix the quantized batchnorm2d ([34579](
* Move QScheme ops to c10 ([30134](
* Remove incorrect fp16 dynamic linear/relu op ([32774](


* Fix serialization memory lifetime issue. ([30603](
* Don't crash callee when function does not exist on it, instead return an Exception ([32726](
* Throw the correct Exception on local client based on the `RemoteException` ([32936](
* Attach autograd edges only for tensors requiring grad. ([30904](
* `WireSerializer` should check `has_storage()` ([34626](
* Fixed potential deadlock in python exception handling ([35283](


* `torch.split`: Fixed incorrect gradient computation that assumed the output was not a view ([32044](
* Allowed numpy integer types to be used where we accept Python integers ([30486](
* `torch.unique`, `torch.unique_consecutive`:  fixed bug with zero-element input support ([31211](
* `Tensor.to_sparse`: fixed backward in the non-contiguous tensor case ([31223](
* `torch.index_put`: Added error checks for input tensors’ devices (31280) ([31280](
* Ensure we switch the CUDA stream correctly in CUDA operations ([31537](, [31538](, [31541](
* `torch.SparseTensor`: ensure the legacy sparse constructor doesn't interpret Python data as tensor data. ([31490](
* `torch.argmax`, `torch.argmin`: Fixed incorrect behavior on large tensors ([33310](
* `torch.div`: Fixed to throw an error when dividing by integer zero on CPU  ([32629](
* `torch.cos`: Fixed incorrect gradient computation caused by not properly initializing temporary vectors in avx2 code ([32722](, [34281](
* `torch.logspace`: Added missing integer dtype support, fixed precision issues in floating-point implementation ([32744](
* ``: Fixed behavior when passed a `torch.half` input tensor and `torch.float` output tensor ([32831](
* `torch.max`, `torch.min`: Fixed NaN handling ([32541](
* `torch.max`, `torch.min`: Added error check that operand and outputs are on the same device type ([32862](
*  `torch.stack`: Added missing input size checks ([32931](
* `torch.add`: Fixed memory leak on certain platforms ([32478](
* `torch.normal`: Fixed shape checks ([33050](
* `torch.cumsum`: fixed to handle inputs with zero-sized dimensions correctly ([31694](
* `torch.device`: Disallow incorrectly formatted device strings ([29087](
* ``: Disallow passing `out` as one of the input tensors ([30577](
* `torch.pdist`: Added support for large batch sizes ([31593](
* `torch.stft`: Fixed crash when used with `nn.DataParallel` ([31861](
* `torch.autograd`: Ensure the original grad mode is restored during backward ([31884](
* `torch.autograd`: Fixed a race condition by locking graph_task before writing leaf_streams. (31995) ([31995](
* `torch.tensordot`: Fixed support for negative dimensions ([31954](
* `torch.cumprod`: Fixed to handle inputs with zero-sized dimensions correctly ([32070](
* `torch.pow`: Fixed the gradient computation when the base is a Tensor or Scalar of zeros ([32062](, [32063](
* `torch.baddbmm`: Fixed bug in corner case ([33538](
* `torch.where`: Added check for consistent devices ([33432](
* `torch.cdist`: Fixed gradient computation for `p=2` and large inputs ([31167](
* ``: Fixed NaN handling ([31666](
* `torch.index_put`: Added handling for large input tensors ([33753](
* `torch.addmm`: Fixed incorrect output when using BLAS backend ([33819](
* `torch.topk` fixed double backward when input has non-finite values ([35253](
* `torch.load`: Avoid problematic pickle usages on Python 3.8.0 and 3.8.1 ([33824](
* ``: Fixed race condition for gradient computation that spans CUDA devices ([31930](
* `Tensor.random_` added check that `from` and `to` are within the Tensor’s dtype bounds ([34033](
* `Tensor.copy_`: Fixed memory overlap check and allowed outputs to be zero-strided tensors if the size is <= 1 along that dimension ([34100](
* `nn.BatchNorm{1,2,3}d`: fixed gradient computation for empty inputs ([32820](
* `nn.BatchNorm`: Fixed behavior for inputs with large batch sizes ([32763](
* `nn.Conv2d`: Fixed 5d weight handling with MKLDNN backend ([34115](
* `nn.Conv3d`: Fixed unstable gradient computation ([34358](
* `nn.Conv{1,2,3}d`: added support for empty batch size([32709](
* `nn.Conv{1,2,3}d`: fixed `CUDNN_STATUS_NOT_SUPPORTED` errors by trying multiple algorithms ([33073](
* `nn.Conv{1,2,3}d`: fixed padding mode support and added additional padding modes (reflection and replication) ([31784](
* `nn.Conv2d`, `nn.Conv3d`, `nn.Conv1d`, `nn.ConvTranspose2d`: Fixed support for batch sizes greater than 2^32 ([31383](, [31379](, [31889](, [34407,]([31510](
* `nn.InstanceNorm`, `nn.GroupNorm`: Added error check for input with exactly one element ([29082](
* `nn.RNN`: Fixed moving RNNs to a device after applying weight norm ([32563](, [32989](
* `nn.MultiLabelMarginLoss`: added support for 0-d tensors ([30765](
* `nn.GroupNorm`: added support for empty batch ([32401](
* `nn.NLLLoss`: fixed to support empty tensors on CUDA ([31491](
* `nn.GroupNorm`: corrected input size check ([33008](
* `nn.MultiLabelMarginLoss`: fixed memory leak on CUDA ([30767](
* `nn.MultiMarginLoss`: fixed error checking on CUDA for the 1D case.  ([30825](
* `nn.Softmax`: Fixed half->float case of softmax backward ([30838](
* `nn.Softshrink`: Added check that lambda is no less than zero ([33201](
* `nn.functional.interpolate`: added support for empty batch size input for interpolate. ([32400](
* `nn.functional.pad`: Also return a new tensor instead of sometimes returning a view ([32350](
* `nn.functional.grid_sample`: Fixed gradient computation at image borders ([32829](
* `nn.functional.leaky_relu_`: disabled incorrect leaky_relu_ negative slope backward calculation ([33639](
* `optim.LambdaLR`: removed unintentional side effects ([32848](
* `optim.Adam`, `optim.AdamW`: Added missing `weight_decay` parameter validation ([33126](
* `optim.MultiStepLR`: Fix “unbound local variable” error by removing return value for `__exit__` ([32997](
* `optim.MultiStepLR`: Fixed broken `step()` method ([33356](
* `torch.autograd`: added new error message if incorrect usage would cause a deadlock ([32295](
* `torch.autograd`: Prohibited copying autograd engines ([34567](
* `torch.autograd`: Fixed incorrect handling of functions that return multiple views ([32790](
* `autograd.Function`: Fixed error if `Function` returned a view in a `torch.no_grad` block ([33896](
* `autograd.Function`: Added more error checks for incorrect behavior ([33069](
* `autograd.Function`: Added nice error message if missing overrides ([33142](
* `autograd.Function`: Fixed version check for `grad_fn` for views ([34145](
* `autograd.profiler`: Fix incorrect chrome trace formatting output for CUDA traces ([33987](
* `multiprocessing.util.register_after_fork`: fixed crash on Windows  ([30809](
* ``: Fixed potential hang when exiting main process ([33721](
* `utils.tensorboard.SummaryWriter` fixed `scale_factor` calculation for uint8 tensor ([31778](
* `utils.tensorboard` Fix for when PyTorch model trace has RecursiveScriptModules ([30430](
* Fixed `CPU_INTEL` flag error on Windows ([30564](
* Don't use `RTLD_GLOBAL` to load `_C`, resolving a multitude of weird segfaults and crashes
when PyTorch is imported along with other packages ([31162](
* Fixed dll load logic for Python 3.8 on Windows ([32215](
* `quasirandom.SobolEngine`: Fixed crash when default tensor type is CUDA ([32496](

* Fixed error message when converting NumPy array with negative strides to a `torch.Tensor` ([33254](
* Fixed crash when indexing a `torch.Tensor` with a single-element array ([33456](
* Fixed crash when converting CUDA tensors and non-strided tensors to NumPy arrays ([33612](
* Prevented crash on exit from static destructor race on Windows ([33955](
* Fixed uncaught `std::domain_error` on macOS ([34301](
* Don’t reset worker affinity when using operators that call into OpenMP ([29006](
* `torch.backends.mkldnn`: changed to be usable without import ([32055](



* Java Tensor hybrid, owns at::Tensor, no memcopy for java outputs. ([30501](
* Tensor prep from image in native ([31426](
* Pass to remove prepacking ops. ([34319](


* Per channel quantization performance improvement ([33772](
* Speed up per-channel min-max observer ([34118](
* Vectorized qmul and more methods on qint data types ([34376](


* Improve `ProcessGroupAgent` serialization speed ([29785](
* Avoid sending large unneeded data over wire in `ProcessGroupAgent`. ([31357](
* Integrate async mode for autograd engine with distributed autograd. ([31508](
* Make handling of `FORWARD_AUTOGRAD_REQ` in `request_callback_impl` nonblocking ([32476](


* Major multithreaded performance regression when doing operator calls resolved ([30333](
* Improved performance of comparison ops on CUDA ([29743](
* `Tensor.view` improved performance ([30554](
* Improved tensor creation overhead ([30452](, [30709](
* `nn.SmoothL1Loss`: vectorized gradient computation on CPU. ([30046](
* `nn.EmbeddingBag`: improved performance on CPU ([30701](, [27477](
* `nn.LayerNorm`: optimized with explicit vectorization using Vec256 ([31127](
* `Tensor.copy_`: fixed kernel speed regression introduced in 29631 ([31279](
* Moved a number of debug asserts to not compile in release builds ([31240](
* `Tensor::has_names` sped up for unnamed tensors ([31436](
* `torch.index_select`: optimized performance on CPU ([30598](
* `nn.Conv{1,2,3}d`: Improved performance by refactoring `bias` handling for cuDNN backend ([31524](
* `torch.norm`: Optimized case where `p = 2` ([31903](
* `nn.utils.clip_grad_norm_`: Refactored the computation for more performance ([32020](
* Made an assert on a hotpath trigger only in DEBUG mode ([32117](
* First steps toward TensorIterator unrolling and vectorized load ([31974](
* `nn.functional.normalize`: changed to use `clamp_min_`  ([32360](
* Stopped refreshing numel on a stride update ([32116](
* `nn.functional.softplus`: vectorized operator and gradient computation on CPU ([32944](
* `torch.gather` regression fixed by not materializing loop vars in error message ([33108](
* `nn.ELU` forward and backward vectorized on CPU ([32985](, [32986](
* ``: optimized performance on CPU ([30806](, [33534](
* `torch.conv3d`: optimized Unfold3d to improve performance ([33191](
* Workaround performance bug and memory leak in GOMP for AMD CPUs ([32875](
* Improved TensorIterator overhead ([33165](
* `torch.conv3d`: optimized Unfold3dAcc to improve gradient computation performance ([33317](
* `torch.roll` improved performance ([33623](
* Bounds checking for functor execution in vectorized/unrolled kernels ([33642](
* `nn.EmbeddingBag`: improved performance on CUDA ([33589](
* Remove unnecessary tensor copies while calling operators ([33732](
* clang intrinsics targeting on Windows ([33958](
* `nn.Dropout`: added vectorized CUDA implementation ([33879](
* `nn.UpSampleNearest{1, 2, 3}d` performance on CPU optimized (31452) ([31452](
* Remove `cudaMemcpy` on full memory overlap ([34548](
* CUDA Loops: move address computation into policy, make `policy.load` load all arguments ([33720](
* `nn.BatchNorm{1, 2, 3}d` contiguous case's performance improved ([34530](
* Add the build for runtime dispatch for AVX, AVX2 instruction set ([26125](
* `nn.RReLU` performance improved up to 5x for inference on CPU  ([31094](
* `nn.LogSigmoid` performance improved up to 10x on CPU ([30958](
* `torch.dist` performance improved up to 2x ([29714](
* `torch.max`, `torch.min` performance improved up to 1.5x on CPU ([33936](
* `nn.GLU` performance improved up to 1.5X on CPU  ([33179](
* `nn.LeakyReLU` performance improved up to 4x ([29899](
* `nn.HardTanh` performance improved up to 5x  ([30152](



* Added documentation for `nn.functional.softplus` ([30055](, [32945](
* `torch.max`: Added warning about different, nondeterministic behavior on CPU and CUDA ([31115](
* Clarified the documentation for `nn.NLLLoss`  ([31488](
* Exclude generated source docs from Google search indexing ([31484](
* `torch.poisson` docstring added to documentation (31667) ([31667](
* `torch.eq` fixed incorrect examples in documentation ([32399](
* `torch.load`: added warning regarding pickle insecurity ([32593](
* `optim.CosineAnnealingLR`: fixed the usage in examples ([31358](
* Added doc previewing instructions ([31905](
* Removed legacy `.data` usages from the `torch.nn` documentation ([31481](
* Fixed description of convolution modules ([30079](
* `Tensor.t()`, `Tensor.permute()`, `Tensor.unfold()`, and `` clarified to note that they return views ([32512](
* `torch.multiprocessing` Updated documentation indicating that start_method is ignored for `mp.spawn()` ([33070](
* Improved CPU threading documentation ([33083](
* `nn.BCELoss`: documented how it avoids infinite results ([33160](
* `nn.utils.rnn.pack_padded_sequence`: Improved the description of `enforce_sorted`  ([33617](
* `nn.utils.pad_packed_sequence`: doc improvement ([33768](
* `nn.LPPool{1,2}d` : removed nonexistent parameter ([33714](
* Created a Tensor View documentation page that documents all PyTorch operations that return views ([32560](
* Added grad context manager doc to top level torch module. ([33877](
* Enhanced reproducibility documentation ([33795](
* Numerous typo fixes ([30448](, [30518](, [30614](, [30464](, [30608](, [24335](, [34581](, [34624](, [34008](, [31395](, [31677](, [31617](, [31973](, [32068](, [33689](, [30385](, [32003](, [31682](, [30846](, [33478](, [33549](, [32307](, [33144](, [33805](, [33836](, [34053](
* Numerous formatting and/or rendering fixes ([30377](, [30779](, [32667](, [34027](, [32911](, [30814](, [30815](, [31760](, [34503](


* Fix `at::Tensor` docs generation and make it accessible again at ([34467](
* Add docs for all `torch::nn modules` and functionals ([34522]( ([34688]( ([34752](
* Improve C++ autograd and tensor indexing docs ([35919](
* Fix example in `torch::nn::ModuleList` docs ([34463](


* Reorganize RPC API doc and add introduction ([30491](, [35109](
* Make doc source format consistent in `rpc/init.cpp` ([30515](
* Add examples to RRef doc ([30516](
* Add more details to explain `rpc_backend_options` arg in `init_rpc` ([30855](
* Fix examples in API doc ([30856](
* Fix examples in RRef API doc ([30857](
* Document WorkerInfo and `RpcBackendOptions` structures in RPC docs. ([31077](
* Explain RPC behavior when using Tensor as arg or return value ([31968](
* Update RPC docs to reflect correct use of dist_autograd backwards and dist_optim `step() `([34670](
* Minor doc tweak to use mp.spawn in example ([30381](
* Update distributed autograd note ([34657](


* Add info about transitive dependencies in case of using local aars ([30128](
* Update Docs for building PyTorch for Android. ([32578](
* Javadoc changes ([31956](


* Updates to quantization documentation ([30288](
* Fix docs so that the example works ([30120](
* Add the explicit per-tensor/per-channel quant info when we print the module ([30591](
* Fixed typos in quantization docs / docstrings ([34182](
* Docs entry for the `is_quantized` ([32075](



How to figure out which line in your code is raising a warning

Attempting to use deprecated behavior will raise warnings. Unfortunately, sometimes it is not entirely obvious what line of code the warning corresponds to, especially if the the warning comes from our C++ backend. For example, with a file named `` with the following contents,

import torch
This is newly deprecated behavior, see the next section
torch.tensor(1) / torch.tensor(2)

running it doesn’t give us the location of the warning:

> python
../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true
division as in Python 3. Use true_divide or floor_divide (// in Python) instead.

We can use the `warnings` module to tell us where the warning is by asking it to treat warnings as errors:

import torch
import warnings
warnings.filterwarnings('error', message='Integer division')
This is newly deprecated behavior, see the next section
torch.tensor(1) / torch.tensor(2)

Running the file now tells us exactly where the warning is:

> python
Traceback (most recent call last):
File "", line 5, in <module>
torch.tensor(1) / torch.tensor(2)
UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide
or floor_divide (// in Python) instead.

Deprecated `torch.div` and `torch.addcdiv` integer floor division behavior ([34570](

In 1.5.0 and older PyTorch releases `torch.div` and the `/` operator perform integer floor division. In a future PyTorch release, torch.div (including the `/` operator) will perform "true" division as in Python3 and NumPy.

To floor divide integer tensors, please use `torch.floor_divide` instead.

<p align="center">
<table align="center">
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.tensor(3) / torch.tensor(2)
../aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
<td><sub><pre lang="python">
>>> NB: the following is equivalent to `torch.floor_divide(torch.tensor(3), torch.tensor(2))
>>> torch.tensor(3) // torch.tensor(2)

The fix for `torch.addcdiv` is similar.

<p align="center">
<table align="center">
<tr valign="top">
<td><sub><pre lang="python">
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> torch.addcdiv(input, tensor, other, value=value)
../aten/src/ATen/native/PointwiseOps.cpp:81: UserWarning: Integer division with addcdiv is deprecated, and in a future  release addcdiv will perform a true division of tensor1 and tensor2. The current addcdiv behavior can be replicated using floor_divide for integral inputs (self + value * tensor1 // tensor2) and division for float inputs (self + value * tensor1 / tensor2). The new addcdiv behavior can be implemented with true_divide (self + value * torch.true_divide(tensor1, tensor2).
<td><sub><pre lang="python">
>>> input = torch.tensor(0)
>>> tensor = torch.tensor(1)
>>> other = torch.tensor(3)
>>> value = 1
>>> (input + torch.floor_divide(value * tensor, other))

Deprecated `torch.full` returning float tensors if no dtype is specified ([34709](

In a future PyTorch release, `torch.full` will infer its dtype from its fill value when the optional dtype and out parameters are unspecified, matching NumPy's inference for `numpy.full`. For example, `torch.full(size, 1)` will return a tensor of `torch.long` dtype, unlike today where it returns a tensor of `torch.float` dtype.

Deprecated `torch.nn.modules.conv._ConvTransposeMixin` ([31784](

This is an internal-facing class that is not a part of our public API. We’ve refactored some PyTorch internals to work without it and will remove it in a future release.

Deprecated positional args in multiple `torch` function signatures ([32009](, [33428](

Below please find a list of deprecated signatures and what to change them to.

* `torch.add(self: Tensor, alpha: Scalar, other: Tensor)`, `torch.sub(self: Tensor, alpha: Scalar, other: Tensor)` please use `alpha` as a keyword-only arg instead of positional args
* `torch.addbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor)`: please use `alpha` and `beta` as keyword only args instead of positional args.
* `torch.addcdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor)`, `torch.addmdiv(self: Tensor, value: Scalar, tensor1: Tensor, tensor2: Tensor)`: please use `value` as a keyword-only arg
* `torch.addmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor)`, `torch.sspaddmm(beta: Scalar, self: Tensor, alpha: Scalar, mat1: Tensor, mat2: Tensor)` please use `alpha` and `beta` as keyword only args instead of positional args.
* `torch.addmv(beta: Scalar, self: Tensor, alpha: Scalar, mat: Tensor, vec: Tensor)`: please use `alpha` and `beta` as keyword only args instead of positional args.
* `torch.addr(beta: Scalar, self: Tensor, alpha: Scalar, vec1: Tensor, vec2: Scalar)`: please use `alpha` and `beta` as keyword only args instead of positional args.
* `torch.baddbmm(beta: Scalar, self: Tensor, alpha: Scalar, batch1: Tensor, batch2: Tensor)`: please use `alpha` and `beta` as keyword only args instead of positional args.

<p align="center">
<table align="center">
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.zeros(2,3).add(2, torch.ones(2, 3))
../torch/csrc/utils/python_arg_parser.cpp:750: UserWarning: This overload of add is deprecated:
add(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add(Tensor other, Number alpha)
tensor([[2., 2., 2.],
[2., 2., 2.]])
<td><sub><pre lang="python">
>>> torch.zeros(2, 3).add(torch.ones(2, 3), alpha=2)
tensor([[2., 2., 2.],
[2., 2., 2.]])

Deprecate modifying in-place a view that returned by a custom autograd Function ([32839](

Modifying in-place a view that was created by a custom Function leads to the custom backward not being called or being called with a partial gradient. This behavior will be removed in 1.6.

Please clone() the output of the Function to avoid incorrect gradient computation.

class Id(Function):
def forward(ctx, input):
return input.view_as(input)

def backward(ctx, grad_input):
return grad_input

<p align="center">
<table align="center">
<tr><th>Version 1.5.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> input = torch.randn(3, requires_grad=True)
>>> other = torch.randn(3)
>>> output = Id.apply(input)
>>> output.copy_(other)
Warning: Incorrect gradients
<td><sub><pre lang="python">
>>> input = torch.randn(3, requires_grad=True)
>>> other = torch.randn(3)
>>> output = Id.apply(input).clone()
>>> output.copy_(other)


Deprecated `Tensor.type()` [(30281](

Please use `Tensor.options()` instead.


* Part of an automated mixed-precision solution ([33366](, [33832](


[ 0.8414,  1.7962,  1.0589],
[-0.1369, -1.0462, -0.4373]], dtype=torch.float64, device='cuda:1')
>>> x.requires_grad   default is False
>>> x = torch.zeros(3, requires_grad=True)
>>> x.requires_grad

[``torch.tensor``]( is one of the newly added [tensor creation methods]( It takes in array like data of all kinds and copies the contained values into a new ``Tensor``. As mentioned earlier, [``torch.tensor``]( is the PyTorch equivalent of NumPy's ``numpy.array`` constructor.  Unlike the ``torch.*Tensor`` methods, you can also create zero-dimensional ``Tensor``s (aka scalars) this way (a single python number is treated as a Size in the``torch.*Tensor``  methods). Moreover, if a ``dtype`` argument isn't given, it will infer the suitable ``dtype`` given the data. It is the recommended way to create a tensor from existing data like a Python list. For example,

>>> cuda = torch.device("cuda")
>>> torch.tensor([[1], [2], [3]], dtype=torch.half, device=cuda)
tensor([[ 1],
[ 2],
[ 3]], device='cuda:0')
>>> torch.tensor(1)                scalar
>>> torch.tensor([1, 2.3]).dtype   type inferece
>>> torch.tensor([1, 2]).dtype     type inferece

We've also added more tensor creation methods. Some of them have ``torch.*_like`` and/or ``tensor.new_*`` variants.

1. ``torch.*_like`` takes in an input ``Tensor`` instead of a shape. It returns a ``Tensor`` with same attributes as the input ``Tensor`` by default unless otherwise specified:

>>> x = torch.randn(3, dtype=torch.float64)
>>> torch.zeros_like(x)
tensor([ 0.,  0.,  0.], dtype=torch.float64)
>>> torch.zeros_like(x,
tensor([ 0,  0,  0], dtype=torch.int32)

2. ``tensor.new_*`` can also create ``Tensor``s with same attributes as ``tensor``, but it always takes in a shape argument:

>>> x = torch.randn(3, dtype=torch.float64)
>>> x.new_ones(2)
tensor([ 1.,  1.], dtype=torch.float64)
>>> x.new_ones(4,
tensor([ 1,  1,  1,  1], dtype=torch.int32)

To specify the desired shape, you can either use a tuple (e.g., ``torch.zeros((2, 3))``) or variable arguments (e.g., ``torch.zeros(2, 3)``) in most cases.

| Name                                                       | Returned ``Tensor``                                       | ``torch.*_like`` variant | ``tensor.new_*`` variant |
| [``torch.empty``](                                            | unintialized memory                                       | ✔                        | ✔                        |
| [``torch.zeros``](                                            | all zeros                                                 | ✔                        | ✔                        |
| [``torch.ones``](                                             | all ones                                                  | ✔                        | ✔                        |
| [``torch.full``](                                             | filled with a given value                                 | ✔                        | ✔                        |
| [``torch.rand``](                                             | i.i.d. continuous ``Uniform[0, 1)``                       | ✔                        |                          |
| [``torch.randn``](                                            | i.i.d. ``Normal(0, 1)``                                   | ✔                        |                          |
| [``torch.randint``](                                          | i.i.d. discrete Uniform in given range                    | ✔                        |                          |
| [``torch.randperm``](                                         | random permutation of ``{0, 1, ..., n - 1}``              |                          |                          |
| [``torch.tensor``](                                           | copied from existing data (`list`, NumPy `ndarray`, etc.) |                          | ✔                        |
| [``torch.from_numpy``](*                                      | from NumPy ``ndarray`` (sharing storage without copying)  |                          |                          |
| [``torch.arange``](, <br>[``torch.range``](, and <br>[``torch.linspace``](  | uniformly spaced values in a given range                  |                          |                          |
| [``torch.logspace``](                                         | logarithmically spaced values in a given range            |                          |                          |
| [``torch.eye``](                                              | identity matrix                                           |                          |                          |

*: [``torch.from_numpy``]( only takes in a NumPy ``ndarray`` as its input argument.

Writing device-agnostic code

Previous versions of PyTorch made it difficult to write code that was device agnostic (i.e. that could run on both CUDA-enabled and CPU-only machines without modification).

PyTorch 0.4.0 makes this easier in two ways:
* The `device` attribute of a Tensor gives the [``torch.device``]( for all Tensors (`get_device` only works for CUDA tensors)
* The `to` method of ``Tensors`` and ``Modules`` can be used to easily move objects to different devices (instead of having to call `cpu()` or `cuda()` based on the context)

We recommend the following pattern:
at beginning of the script
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


then whenever you get a new Tensor or Module
this won't copy if they are already on the desired device
input =
model = MyModule(...).to(device)


Full support for Advanced indexing

PyTorch now has full support for advanced indexing, following numpy's advanced indexing rules. The following examples are now possible:

a = torch.rand(10, 10, 10, 10)

the indexing elements can have other shapes than 1
b = a[[[3, 2]], :, [[1, 3]]]

broadcasting also supported in the indices, as well as lists,
negative indices, slices, elipses, numbers
c = a[[1, -2], 2:4, :, [1]]

can also support tensors as indices
index = torch.tensor([2, 4])
d = a[index]

and the indices can be on the GPU
or CPU
e = a[index.cuda()]
f = a.cuda()[index]


class Exp(torch.autograd.Function):
def forward(ctx, i):
result = i.exp()
return result

def backward(ctx, grad_output):
result, = ctx.saved_tensors
return grad_output * result


`torch.optim` optimizers changed to fix in-place checks for the changes made by the optimizer  ([33640](, [34211](

If this causes your code to fail, there are two possible reasons:

Reason 1: The value of that parameter was actually saved and used and we were computing incorrect gradients in previous versions of PyTorch. This would result in an error message mentioning incorrect version numbers. You should replace code that uses `self.my_param` by `self.my_param.clone()` to make sure the saved version is different from the one that is modified by the optimizer. For example:

Before 1.5.0, the following may have worked.

def model(input, target, param):
return `(input * param ** 2 - target).norm()`

param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param)

If after upgrading to 1.5.0, the above fails due to a version counter error, then that means the gradient computed was incorrect. To remedy this, clone `param` before using it in the model:

def model(input, target, param):
return (input * param ** 2 - target).norm()

param = torch.randn(2, requires_grad=True)
input = torch.randn(2)
target = torch.randn(2)
sgd = optim.SGD([param], lr=0.001)
loss = model(input, target, param.clone())

Reason 2: You know what you're doing and change the values back to the right thing before the next backward. However, you're running into an error because the version counter cannot be decremented. Open an issue with your particular use case and we will help you to work around the version counter issue.

`utils.cpp_extensions` now use `ninja` as the default compilation backend ([32495](

`ninja` enables parallel compilation of your C++ extension, greatly speeding up compilation. This change will not break most user code; if you do not have `ninja` installed, we fallback to the old `distutils` backend.

However, if you do have `ninja` installed, it is possible that this change will cause your C++ extension build to fail by oversubscribing your system with too many worker processes. There are two potential workarounds to this.

Method 1: If a previously succeeding `python install` now fails, try setting the `MAX_JOBS` environment variable.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="sh">
python install
<td><sub><pre lang="sh">
MAX_JOBS=2 python install

Method 2: Switch back to the old `distutils` backend inside your ``

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
cmdclass={'clean': clean,
'build_ext': BuildExtension},
<td><sub><pre lang="python">
cmdclass={'clean': clean,
'build_ext': BuildExtension.with_options(use_ninja=False)},

`torch.optim.Adam`, `torch.optim.SGD`  changed to not modify gradients in-place ([30257](

In previous versions of PyTorch, the Adam and SGD optimizers modified gradients (e.g. `param.grad`) in-place via in-place addition of `params.grad += weight_decay * param`. To make this consistent with the behavior of other optimizers and to prevent surprises about the behavior, we’ve changed them to stop modifying gradients in-place.

This should not have an effect on most PyTorch programs unless they relied on this behavior. The easiest way to replicate the old behavior is to create a custom optimizer that implements it.

`torch.masked_select` now always returns a 1D tensor ([29923](

The behavior of `torch.masked_select` when both "self" and "mask" are 0-dimensional was changed. In previous versions of PyTorch, this would return a 0-dimensional tensor. Now, we return a 1-dimensional tensor to be consistent with other input sizes and our documentation.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.masked_select(torch.tensor(0), torch.tensor(True))
<td><sub><pre lang="python">
>>> torch.masked_select(torch.tensor(0), torch.tensor(True))

`torch.index_select` on a 0-d tensor now returns a 0-d tensor. ([30790](

In previous versions of PyTorch, the output of `torch.index_select` on a 0D input tensor produced a 1D tensor. This was inconsistent with our documentation on it, which stated "The returned tensor has the same number of dimensions as the original tensor (input)." Now, we return a 0D tensor.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0]))
<td><sub><pre lang="python">
>>> torch.index_select(torch.tensor(5), 0, torch.tensor([0]))

`nn.MultiLabelMarginLoss:` 'none' reduction on 1D tensor now returns a 0D tensor ([30768](

In previous versions of PyTorch, the output of `nn.MultiLabelMarginLoss` on 1D and 0D tensors incorrectly produced 1-D tensors. Now, those cases return a 0D tensor to be consistent with the 2-D tensor case.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long))
<td><sub><pre lang="python">
>>> nn.MultiLabelMarginLoss(reduction='none')(torch.randn(3), torch.zeros(3, dtype=torch.long))

`nn.MultiMarginLoss:` ‘none' reduction on 1D target now returns a 1D tensor ([30826](

In previous versions of PyTorch, the output of `nn.MultiMarginLoss` on a 1D `target` tensor produced a 0D output. We changed this to return a 1D `target` tensor to make it consistent with other input sizes which return an output that matches the target shape.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0]))
<td><sub><pre lang="python">
>>> nn.MultiMarginLoss(reduction='none')(torch.tensor([1.]), torch.tensor([0]))

`Tensor.exponential_(lambda)` no longer supports `lambda < 0` ([32501](

`lambda`, the rate parameter of the exponential distribution, mathematically should be greater than 0. We’ve disabled support `lambda < 0` to be mathematically correct; most users will not have used a lambda less than zero.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
tensor = torch.empty(3).exponential_(-1.5)
<td><sub><pre lang="python">
Negative lambda not supported!

`nn.BCELoss`, `nn.functional.binary_cross_entropy` no longer accept inputs with the same number of elements that are not broadcastable ([31365](

Previously, we supported accepting inputs with the same number of elements. However, this behavior was deprecated and we removed it in 1.5.0. In order to replicate the old behavior, please explicitly `reshape` your input and target tensors to have the same shape.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> input = torch.rand(3, 3)
>>> target = torch.randn(9)
>>> torch.nn.functional.binary_cross_entropy(input, target)
<td><sub><pre lang="python">
>>> input = torch.rand(3, 3)
>>> target = torch.randn(9)
>>> torch.nn.functional.binary_cross_entropy(input, target.reshape_as(input))

`torch.normal` out argument is now required to have the same size as the computed output ([32031](

Previously, on CPU devices, `torch.normal(mean, std, out=out)`  would resize `out` to the correct size. To be consistent with the CUDA implementation, we’ve changed it so that `out` must either already have the correct size, or be an empty tensor with size `[0]`. To work around this, please ensure that your `out` tensor has the correct size.

<p align="center">
<table align="center">
<tr><th>Version 1.4.0</th><th>Version 1.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.normal(torch.zeros(3), torch.ones(3), out=torch.randn(2))


PyTorch 1.4.0 Release Notes
- Highlights
- Backwards Incompatible Changes
* Python
* C++
- New Features
* torch.optim
* Distributed
* RPC [Experimental]
* Mobile
- Improvements
* Distributed
* Mobile
* Named Tensors
* C++ API
* AMD Support
* Quantization
* Visualization
* Other Improvements
- Bug Fixes
* Distributed
* C++ API
* Quantization
* Mobile
* Other Bug fixes
- Deprecations
- Performance

The PyTorch v1.4.0 release is now available.

The release contains over 1,500 commits and a significant amount of effort in areas spanning existing areas like JIT, ONNX, Distributed, Performance and Eager Frontend Improvements and improvements to experimental areas like mobile and quantization. It also contains new experimental features including rpc-based model parallel distributed training and language bindings for the Java language (inference only).

**PyTorch 1.4 is the last release that supports Python 2**.  For the C++ API, it is the last release that supports C++11: you should start migrating to Python 3 and building with C++14 to make the future transition from 1.4 to 1.5 easier.


PyTorch Mobile - Build level customization

Following the experimental release of [PyTorch Mobile in the 1.3 release](, PyTorch 1.4 adds additional mobile support including the ability to customize build scripts at a fine-grain level. This allows mobile developers to optimize library size by only including the operators used by their models and, in the process, reduce their on device footprint significantly. Initial results show that, for example, a customized MobileNetV2 is 40% to 50% smaller than the prebuilt PyTorch mobile library. [Learn more]( about how to create your own custom builds, and please engage with the community on the [PyTorch forums]( to provide any feedback you have.

Distributed Model Parallel Training [Experimental]

With the scale of models, such as RoBERTa, continuing to increase into the billions of parameters, model parallel training has become ever more important to help researchers push the limits. This release provides a distributed RPC framework to support distributed model parallel training. It allows for running functions remotely and referencing remote objects without copying the real data around, and provides autograd and optimizer APIs to transparently run backwards and update parameters across RPC boundaries.

To learn more about the APIs and the design of this feature, see the links below:

* [API documentation](
* [Distributed Autograd design doc](
* [Remote Reference design doc](

For the full tutorials, see the links below:

* [A full RPC tutorial](
* [Examples using model parallel training for reinforcement learning and with an LSTM](

As always, you can connect with community members and discuss more on the [forums](

Java bindings [Experimental]

In addition to supporting Python and C++, this release adds experimental support for Java bindings. Based on the interface developed for Android in PyTorch Mobile, the new bindings allow you to invoke TorchScript models from any Java program. Note that the Java bindings are only available for Linux for this release, and for inference only. We expect support to expand in subsequent releases. See the code snippet below for how to use PyTorch within Java:

Learn more about how to use PyTorch from Java [here](, and see the full Javadocs API documentation [here](


Pruning functionalities have been added to PyTorch in the `nn.utils.prune` module. This provides out-of-the-box support for common magnitude-based and random pruning techniques, both structured and unstructured, both layer-wise and global, and it also enables custom pruning from user-provided masks.

To prune a tensor, first select a pruning technique among those available in `nn.utils.prune` (or implement your own by subclassing `BasePruningMethod`).
from torch.nn.utils import prune
t = torch.rand(2, 5)
p = prune.L1Unstructured(amount=0.7)
pruned_tensor = p.prune(t)

To prune a module, select one of the pruning functions available in `nn.utils.prune` (or implement your own) and specify which module and which parameter within that module pruning should act on.
m = nn.Conv2d(3, 1, 2)
prune.ln_structured(module=m, name='weight', amount=5, n=2, dim=1)

Pruning reparametrizes the module by turning `weight` (in the example above) from a parameter to an attribute, and replacing it with a new parameter called `weight_orig` (i.e. appending `"_orig"` to the initial parameter `name`) that stores the unpruned version of the tensor. The pruning mask is stored as a buffer named `weight_mask` (i.e. appending `"_mask"` to the initial parameter `name`). Pruning is applied prior to each forward pass by recomputing `weight` through a multiplication with the updated mask using PyTorch's `forward_pre_hooks`.

Iterative pruning is seamlessly enabled by repeatedly calling pruning functions on the same parameter (this automatically handles the combination of successive masks by making use of a `PruningContainer` under the hood).

`nn.utils.prune` is easily extensible to support new pruning functions by subclassing the `BasePruningMethod` base class and implementing the `compute_mask` method with the instructions to compute the mask according to the logic of the new pruning technique.

Backwards Incompatible Changes


`torch.optim`: It is no longer supported to use `Scheduler.get_lr()` to obtain the last computed learning rate.  to get the last computed learning rate, call `Scheduler.get_last_lr()` instead.  ([26423](

Learning rate schedulers are now “chainable,” as mentioned in the *New Features* section below.  `Scheduler.get_lr` was sometimes used for monitoring purposes to obtain the current learning rate.  But since `Scheduler.get_lr` is also used internally for computing new learning rates, this actually returns a value that is “one step ahead.”  To get the last computed learning rate, use `Scheduler.get_last_lr` instead.

Note that `optimizer.param_groups[0]['lr']` was in version 1.3.1 and remains in 1.4.0 a way of getting the current learning rate used in the optimizer.

`Tensor.unfold` on a 0-dimensional Tensor now properly returns a 1-dimensional Tensor.

<p align="center">
<table align="center">
<tr><th>Version 1.3.1</th><th>Version 1.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1)
<td><sub><pre lang="python">
>>> torch.tensor(5).unfold(dimension=0, size=1, step=1)

`torch.symeig` now return a 0-element eigenvectors tensor when `eigenvectors=False` (the default).

<p align="center">
<table align="center">
<tr><th>Version 1.3.1</th><th>Version 1.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape
torch.Size([3, 3])
<td><sub><pre lang="python">
>>> torch.symeig(torch.randn(3,3)).eigenvectors.shape


* Make `torch.jit.get_trace_graph` private (it is now `torch.jit._get_trace_graph`) ([29149](
* This function was intended only for ONNX integration; use `traced_module.graph` instead, like:
* traced_module = torch.jit.trace(my_module, example_inputs)
traced_graph = traced_module.graph
* `property` on `ScriptModule`s has been disabled ([28395](
* Scripted `property` accesses were silently broken before, where we would evaluate the the `get` function once and store that as the attribute permanently. They properly error now; a workaround is to make your `property` a regular method.
* Custom ops: `torch::jit::RegisterOperators` has been removed, use `torch::RegisterOperators` instead ([28229]( The usage and behavior should remain the same.
* Remove` torch.jit._register_*` bindings from Python (e.g. `torch.jit._register_attribute`). These were private functions that were not intended to be used.  ([29499](


[C++] The distinction between Tensor and Variable has been eliminated at the C++ level. ([28287](

This change simplifies our C++ API and matches previous changes we did at the python level that merged Tensors and Variables into a single type.

This change is unlikely to affect user code; the most likely exceptions are:

1) [Argument-dependent lookup]( for `torch::autograd` may no longer work.  This can break because Variable is now defined as an alias for Tensor (`using Variable = Tensor;`).  In this case, you must explicitly qualify the calls to `torch::autograd` functions.

2) Because `Variable` and `Tensor` are now the same type, code which assumes that they are different types (e.g., for the purposes of templating, or `std::enable_if` checks) will not work until you delete the (now) redundant overload/specialization.

3) Some operators may trace differently.  If this happens, please [file a bug.](  The most likely situations are:

1. There are now *more* operations in your trace than before (usually, calls to `aten::empty`)
2. There are now *less* operations in your trace than before (e.g., the trace complains that `"there is no observable dependence"` with the inputs)

[C++] arguments in `torch::nn::LinearOptions` are renamed to match the Python API. ([27382](

* Arguments that are renamed:
* `in` -> `in_features`
* `out` -> `out_features`
* `with_bias` -> `bias`

[C++] arguments in `torch::nn::Conv{1,2,3}dOptions` are renamed to match the Python API. ([28917]( ([29838](

* Arguments that are renamed:
* `input_channels` -> `in_channels`
* `output_channels` -> `out_channels`
* `with_bias` -> `bias`

[C++] `torch::nn::Conv{1,2,3}dOptions` no longer has the `transposed` argument. ([31005](

* If users have `transposed` originally set to `true` in `torch::nn::Conv{1,2,3}dOptions`, they should migrate their code to use `torch::nn::ConvTranspose{1,2,3}d` layers instead.

[C++] All Reduction enums for `torch::nn` layers and functionals are changed to have `torch::KEnumNAME` syntax. ([27942](, [26837](

* Example: previously, to specify “mean” as the reduction method in a torch::nn layer or functional, we would use `torch::Reduction::Mean`. Now, `torch::Reduction::Mean` has been renamed to the shorter `torch::kMean`.

[C++] `torch::tensor` constructor is improved to match Python API behavior. ([28523]( ([29632]( ([29066](

* Shape checking fixes
* Example 1: previously, `torch::tensor({{1}, {2}})` produced a tensor of sizes `{2}`. Now, it produces a tensor of sizes `{2, 1}`.
* Example 2: previously, `torch::tensor(1.1)` produced a 1-dim tensor. Now it produces a 0-dim tensor.
* Type inference improvements
* Example 1: previously, C++ `torch::tensor` with a double (e.g. `torch::tensor(1.1)`) or a (nested) braced-init-list of doubles (e.g. `torch::tensor({{1.1, 2.2}})` produces a tensor with dtype `torch::kDouble`. Now it produces a tensor with dtype `torch::get_default_dtype()`.
* Example 2: previously, C++ `torch::tensor` with an integer type (e.g. `torch::tensor(1)`) or a (nested) braced-init-list of integer types (e.g. `torch::tensor({{1, 2}})`) produces a tensor with the same dtype. Now it always produces a tensor of dtype `torch::kLong` (aka. `int64_t`).
* Example 3: previously, when passed a `TensorOptions` without a dtype set to the `torch::tensor` constructor, it always produces a tensor of dtype `torch::get_default_dtype()`. Now it produces a tensor of different dtypes based on the dtype of the braced-init-list and the default dtype.
* Passing a `std::initializer_list` (NOT braced-init-list) to `torch::tensor` will no longer compile, and the user should pass the equivalent braced-init-list to `torch::tensor` instead. For example, write `torch::tensor({1.1, 1.2})` instead of `torch::tensor(std::initializer_list<double>({1.1, 1.2}))`.

[C++] Some activation modules’ `forward` function now take `Tensor` instead of `Tensor&` as input. ([28501](

`torch::nn` layers affected: `ELU` / `SELU` / `Hardtanh` / `LeakyReLU` / `ReLU` / `ReLU6` / `RReLU` / `CELU`
This change ensures that the above layers can be used in a `torch::nn::Sequential` module. If your C++ model uses any of the above layers, you must recompile your C++ code with the new libtorch binary.

New Features


Learning rate schedulers (`torch.optim.lr_scheduler`) now support “chaining.” This means that two schedulers can be defined and stepped one after the other to compound their effect, see example below. Previously, the schedulers would overwrite each other.

>>> import torch
>>> from torch.optim import SGD
>>> from torch.optim.lr_scheduler import ExponentialLR, StepLR
>>> model = [torch.nn.Parameter(torch.randn(2, 2, requires_grad=True))]


Significant Fixes

[Type Promotion]( fixed a bug where type promotion, combined with non-contiguous tensors could compute incorrect results.  ([28253](

<p align="center">
<table align="center">
<tr><th>Version 1.3.0</th><th>Version 1.3.1</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> a = torch.tensor([[True,  True],
[False, True]])
get a non-contiguous tensor
>>> a_transpose = a.t()
type promote by comparing across dtypes (bool -> long)
>>> a_transpose == 0
<td><sub><pre lang="python">
>>> a = torch.tensor([[True,  True],
[False, True]])
get a non-contiguous tensor
>>> a_transpose = a.t()
type promote by comparing across dtypes (bool -> long)
>>> a_transpose == 0
tensor([[False,  True],
[False, False]])

[Type Promotion]( / Indexing: Fixed a Bug that Allowed Mixed-Dtype Indexing and assignment could lead to incorrect results.  Mixed dtype operations of this form are currently disabled, as they were in 1.2.  ([28231](

<p align="center">
<table align="center">
<tr><th>Version 1.3.0</th><th>Version 1.3.1</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> a = torch.ones(5, 2, dtype=torch.float)
>>> b = torch.zeros(5, dtype=torch.long)
>>> a[:, [1]] = b.unsqueeze(-1)
>>> a
<td><sub><pre lang="python">
>>> a = torch.ones(5, 2, dtype=torch.float)
>>> b = torch.zeros(5, dtype=torch.long)
>>> a[:, [1]] = b.unsqueeze(-1)
RuntimeError: expected dtype Float but got dtype Long

[torch.where(condition, x, y)]( fixed a bug on CPU where incorrect results could be returned if `x` and `y` were of different dtypes.  Mixed dtype operations of this form are currently disabled, as they were in version 1.2.  ([29078](

<p align="center">
<table align="center">
<tr><th>Version 1.3.0</th><th>Version 1.3.1</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> x = torch.randn(2, 3)
>>> y = torch.randint(0, 10, (2, 3))
>>> torch.where(x < 0, x, y)
<td><sub><pre lang="python">
>>> x = torch.randn(2, 3)
>>> y = torch.randint(0, 10, (2, 3))
>>> torch.where(x < 0, x, y)
RuntimeError: expected scalar type Float but found Long

Other Fixes

* `torch.argmax`: fix regression on CUDA that disabled support for `torch.float16` inputs.  ([28915](
* NamedTensor: fix Python refcounting bug with `Tensor.names`.  ([28922](
* Quantization: support `deepcopy` for quantized tensors.  ([28612](
* Quantization: support `nn.quantized.ReLU` with `inplace=True`.  ([28710](
* Documentation: `torch.lgamma` and `torch.polygamma` are now documented.  ([28964](


Table of Contents

- Breaking Changes
- Highlights
* [Experimental]: Mobile Support
* [Experimental]: Named Tensor Support
* [Experimental]: Quantization support
* Type Promotion
* Deprecations
- New Features
* TensorBoard: 3D Mesh and Hyperparameter Support
* Distributed
* Libtorch Binaries with C++11 ABI
* New TorchScript features
- Improvements
* C++ Frontend Improvements
+ Autograd
+ New torch::nn modules
+ New torch::nn::functional functions
+ tensor Construction API
+ Other C++ Improvements
* Distributed Improvements
* Performance Improvements
* JIT Improvements
* ONNX Exporter Improvements
+ Adding Support for ONNX IR v4
+ Adding Support for ONNX Opset 11
+ Exporting More Torch Operators/Models to ONNX
+ Enhancing ONNX Export Infra
* Other Improvements
- Bug Fixes
+ TensorBoard Bug Fixes
+ C++ API Bug fixes
+ Other Bug Fixes
- Documentation Updates
+ Distributed
+ Other documentation improvements

Breaking Changes

Type Promotion: Mixed dtype operations may return a different dtype and value than in previous versions.  ([22273](, [26981](

Previous versions of PyTorch supported a limited number of mixed dtype operations. These operations could result in loss of precision by, for example, truncating floating-point zero-dimensional tensors or Python numbers.

In Version 1.3, PyTorch supports NumPy-style type promotion (with slightly modified rules, see [full documentation](  These rules generally will retain precision and be less surprising to users.

<p align="center">
<table align="center">
<tr><th>Version 1.2</th><th>Version 1.3</th></tr>
<tr valign="top">
<td><sub><pre lang="python">


class MyModule(torch.nn.Module):

Construct an nn.Module instance
module = MyModule(args)

Pass it to `torch.jit.script` to compile it into a ScriptModule.
my_torchscript_module = torch.jit.script(module)

`torch.jit.script()` will attempt to recursively compile the given `nn.Module`, including any submodules or methods called from `forward()`. See the [migration guide]( for more info on what's changed and how to migrate.

[JIT] Improved TorchScript Python language coverage

In 1.2, TorchScript has significantly improved its support for Python language constructs and Python's standard library. Highlights include:

* Early returns, breaks and continues.
* Iterator-based constructs, like `` loops, `zip()`, and `enumerate()`.
* `NamedTuples`.
* `math` and `string` library support.
* Support for most Python builtin functions.

See the detailed notes below for more information.

Expanded Onnx Export

In PyTorch 1.2, working with Microsoft, we’ve added full support to export ONNX Opset versions 7(v1.2), 8(v1.3), 9(v1.4) and 10 (v1.5). We’ve have also enhanced the constant folding pass to support Opset 10, the latest available version of ONNX. Additionally, users now are able to register their own symbolic to export custom ops, and specify the dynamic dimensions of inputs during export. Here is a summary of the all of the major improvements:

* Support for multiple Opsets including the ability to export dropout, slice, flip and interpolate in Opset 10.
* Improvements to ScriptModule including support for multiple outputs, tensor factories and tuples as inputs and outputs.
* More than a dozen additional PyTorch operators supported including the ability to export a custom operator.

Updated docs can be found [here]( and also a refreshed tutorial using ONNXRuntime can be found [here](

Tensorboard is no Longer Considered Experimental

Read the [documentation]( or simply type **`from`**` torch.utils.tensorboard `**`import`**` SummaryWriter` to get started!


We include a standard [nn.Transformer]( module, based on the paper “[_Attention is All You Need_](”.  The `nn.Transformer` module relies entirely on an [attention mechanism]( to draw global dependencies between input and output.  The individual components of the `nn.Transformer` module are designed so they can be adopted independently.  For example, the [nn.TransformerEncoder]( can be used by itself, without the larger `nn.Transformer`. New APIs include:

* `nn.Transformer`
* `nn.TransformerEncoder` and `nn.TransformerEncoderLayer`
* `nn.TransformerDecoder` and `nn.TransformerDecoderLayer`

See the [Transformer Layers]( documentation for more info.

Breaking Changes

Comparison operations (`lt (<), le (<=), gt (>), ge (>=), eq (==), ne, (!=)` ) return dtype has changed from `torch.uint8` to `torch.bool` ([21113](

*Version 1.1:*

>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([1, 0, 0], dtype=torch.uint8)

*Version 1.2:*

>>> torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2])
tensor([True, False, False])

For most programs, we don't expect that any changes will need to be made as a result of this change. There are a couple of possible exceptions listed below.

**Mask Inversion**

In prior versions of PyTorch, the idiomatic way to invert a mask was to call `1 - mask`.  This behavior is no longer supported; use the `~` or `bitwise_not()` operator instead.

*Version 1.1*:

>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([0, 1, 1], dtype=torch.uint8)

*Version 1.2:*

>>> 1 - (torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported.
If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.

>>> ~(torch.tensor([1, 2, 3]) < torch.tensor([3, 1, 2]))
tensor([False,  True,  True])

**sum(Tensor) (python built-in) does not upcast `dtype` like `torch.sum`**

Python's built-in `sum` returns results in the same `dtype` as the tensor itself, so it will not return the expected result if the value of the sum cannot be represented in the `dtype` of the tensor.

*Version 1.1*:

value can be represented in result dtype
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)
tensor(3, dtype=torch.uint8)

value can NOT be represented in result dtype
>>> sum(torch.ones((300,)) > 0)
tensor(44, dtype=torch.uint8)

torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)

*Version 1.2:*

value cannot be represented in result dtype (now torch.bool)
>>> sum(torch.tensor([1, 2, 3, 4, 5]) > 2)

value cannot be represented in result dtype
>>> sum(torch.ones((300,)) > 0)

torch.sum properly upcasts result dtype
>>> torch.sum(torch.ones((300,)) > 0)

**TLDR**: use `torch.sum` instead of the built-in `sum`.  Note that the built-in `sum()` behavior will more closely resemble `torch.sum` in the next release.

Note also that masking via `torch.uint8` Tensors is now deprecated, see the **Deprecations** section for more information.

`__invert__` / `~`: now calls `torch.bitwise_not` instead of `1 - tensor` and is supported for all integral+Boolean dtypes instead of only `torch.uint8`.  ([22326](

*Version 1.1*:

>>> ~torch.arange(8, dtype=torch.uint8)
tensor([ 1, 0, 255, 254, 253, 252, 251, 250], dtype=torch.uint8)

*Version 1.2*:

>>> ~torch.arange(8, dtype=torch.uint8)
tensor([255, 254, 253, 252, 251, 250, 249, 248], dtype=torch.uint8)

`torch.tensor(bool)` and `torch.as_tensor(bool)` now infer `torch.bool` dtype instead of `torch.uint8`.  ([19097](

*Version 1.1:*

>>> torch.tensor([True, False])
tensor([1, 0], dtype=torch.uint8)

*Version 1.2:*

>>> torch.tensor([True, False])
tensor([ True, False])

`nn.BatchNorm{1,2,3}D`: gamma (`weight`) is now initialized to all 1s rather than randomly initialized from *U(0, 1)*.  ([13774](

*Version 1.1:*

>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:


We have just released PyTorch v1.2.0.

It has over 1,900 commits and contains a significant amount of effort in areas spanning JIT, ONNX, Distributed, as well as Performance and Eager Frontend Improvements.


[JIT] New TorchScript API


Note: CUDA 8.0 is no longer supported


TensorBoard (currently experimental)

First-class and native support for visualization and model debugging with [TensorBoard](, a web application suite for inspecting and understanding training runs, tensors, and graphs. PyTorch now supports TensorBoard logging with a simple `from torch.utils.tensorboard import SummaryWriter` command. Histograms, embeddings, scalars, images, text, graphs, and more can be visualized across training runs. TensorBoard support is currently experimental. You can browse the docs [here](


[JIT] Attributes in ScriptModules
Attributes can be assigned on a `ScriptModule` by wrapping them with `torch.jit.Attribute` and specifying the type. Attributes are similar to parameters or buffers, but can be of any type. They will be serialized along with any paramters/buffers when you call ``, so they are a great way to store arbitrary state in your model. See [the docs]( for more info.


class Foo(torch.jit.ScriptModule):
def __init__(self, a_dict):
super(Foo, self).__init__(False)
self.words = torch.jit.Attribute([], List[str])
self.some_dict = torch.jit.Attribute(a_dict, Dict[str, int])

def forward(self, input: str) -> int:
return self.some_dict[input]

[JIT] Dictionary and List Support in TorchScript
TorchScript now has robust support for list and dictionary types. They behave much like Python lists and dictionaries, supporting most built-in methods, as well as simple comprehensions and `for…in` constructs.

[JIT] User-defined classes in TorchScript (experimental)
For more complex stateful operations, TorchScript now supports annotating a class with `torch.jit.script`. Classes used this way can be JIT-compiled and loaded in C++ like other TorchScript modules. See [the docs]( for more info.

class Pair:
def __init__(self, first, second)
self.first = first
self.second = second

def sum(self):
return self.first + self.second

DistributedDataParallel new functionality and tutorials

`nn.parallel.DistributedDataParallel`: can now wrap multi-GPU modules, which enables use cases such as model parallel ([tutorial]( on one server and data parallel ([tutorial]( across servers.

Breaking Changes
* `Tensor.set_`: the `device` of a Tensor can no longer be changed via `Tensor.set_`.  This would most commonly happen when setting up a Tensor with the default CUDA device and later swapping in a `Storage` on a different CUDA device.  Instead, set up the Tensor on the correct device from the beginning.  ([18832](
* Pay attention to the order change of `lr_scheduler.step()`. ([7889](
* `torch.unique`: changed the default value of `sorted` to `True`.  ([15379](
* **[JIT]** Rename isTensor api -> isCompleteTensor. [18437](
* **[JIT]** Remove GraphExecutor's python bindings. [19141](
* **[C++]**: many methods on `Type` no longer exist; use the functional or Tensor method equivalent.  ([17991](
* **[C++]**: the `Backend` constructor of `TensorOptions` no longer exists.  ([18137](
* **[C++, Distributed]**: Remove c10d `ProcessGroup::getGroupRank` has been removed.  ([19147](

New Features

* `torch.tril_indices`, `torch.triu_indices`: added operator with same behavior as NumPy.  ([14904](, [15203](
* `torch.combinations`, `torch.cartesian_prod`: added new `itertools`-like operators.  ([9393](
* `torch.repeat_interleave`: new operator similar to `numpy.repeat`.  ([18395](
* `torch.from_file`: new operator similar to `Storage.from_file`, but returning a tensor.  ([18688](
* `torch.unique_consecutive`: new operator with semantics similar to `std::unique` in C++.  ([19060](
* `torch.tril`, `torch.triu`, `torch.trtrs`: now support batching.  ([15257](, [18025](
* `torch.gather`: add support for `sparse_grad` option.  ([17182](
* `torch.std`, `torch.max_values`, `torch.min_values`, `torch.logsumexp` can now operate over multiple dimensions at once.  ([14535](, [15892](, [16475](
* `torch.cdist`: added operator equivalent to `scipy.spatial.distance.cdist`.  ([16168](, [17173](
* ``: reports detailed version of all libraries.  ([18579](

* `nn.MultiheadedAttention`: new module implementing MultiheadedAttention from `Attention Is All You Need`.  ([18334](
* `nn.functional.interpolate`: added support for `bicubic`.  ([9849](
* `nn.SyncBatchNorm`: support synchronous Batch Normalization.  ([14267](
* `nn.Conv`: added support for Circular Padding via `mode='circular'`.  ([17240](
* `nn.EmbeddingBag`: now supports trainable `per_sample_weights.  ([18799](
* `nn.EmbeddingBag`: add support for `from_pretrained` method, as in `nn.Embedding`.  ([15273](
* `RNNs`: automatically handle unsorted variable-length sequences via `enforce_sorted`.  ([15225](
* `nn.Identity`: new module for easier model surgery.  ([19249](

Tensors / dtypes
* `torch.bool`: added support for `torch.bool` dtype and Tensors with that dtype (1-byte storage).  NumPy conversion is supported, but operations are currently limited.  ([16810](

* `optim.lr_scheduler.CyclicLR`: Support for Cyclical Learning Rate and Momentum.  ([18001](
* `optim.lr_scheduler.CosineAnnealingWarmRestarts`: new scheduler: Stochastic Gradient Descent with Warm Restarts).  ([17226](
* Support multiple simultaneous LR schedulers.  ([14010](

* `torch.distributions`: now support multiple inheritance.  ([16772](

* `quasirandom.SobolEngine`: new sampler.  ([10505](

* `nn.parallel.DistributedDataParallel`: now supports modules with unused parameters (e.g. control flow, like adaptive softmax, etc). ([18251](, [18953](

TorchScript and Tracer
* Allow early returns from if-statements. ([154463](
* Add an `ignore` annotation, which statically tells the TorchScript compiler to ignore the Python function. ([16055](
* Simple ``  loops on lists. ([16726](
* Ellipses (`...`) in Tensor indexing. ([17763](
* `None` in Tensor indexing. ([18615](
* Support for basic list comprehensions. ([17267](
* Add implicit unwrapping of optionals on `if foo is not None`. ([15587](
* Tensors, ints, and floats will once again be implicitly cast to bool if used in a conditional. ([18755](
* Implement `to()`, `cpu()`, and `cuda()` on ScriptModules. ([15340]( ,  [15904](
* Add support for various methods on lists: ([`clear()`](, [`pop()`](, [`reverse()`](, [`copy()`]( ,  [`extend()`](,[`index()`](, [`count()`](, [`insert()`](, [`remove()`]( ).
* Add support for `sort()` on lists of specialized type (`Tensors`, `int`, `float`, `bool`). ([19572](
* Add support for various methods on strings: ([`index()`](, [`slice()`](, [`len()`](
* Support `` in TorchScript. ( [15976]( )
* Support for `Torch.tensor()` in TorchScript. ([14913](,  [19445](
* Support for `torch.manual_seed()` in TorchScript. ([19510](
* Support for `nn.LSTM` in TorchScript. ([15744](
* Support for `nn.init` in TorchScript. ([19640](
* Add `hash()` builtin. ([18258](
* Add `min()` and `max()` builtins for numerical types. ([15680](
* Add `isinstance()` builtin, which performs a static type check. ([15076](
* Add `train()` / `eval()` / `is_training()` to C++ ScriptModule API. ([16044](
* Allow List arguments to Python functions called from TorchScript. ([15721](
* Allow using `std::vector` and `std::unordered_map` as arguments to custom operators. ([17587](
* Tracer: now allows passing static dicts and lists as trace inputs. ([18092](, [19580](
* Allow generic containers as ScriptModule inputs. ([16482](
* Allow `nn.Sequential` in ModuleList. ([16882](

Experimental Features
* [Quantization] **(API unstable)**: added limited support for quantized datatypes via `torch.qint8` dtype, `torch.quantize_linear` conversion function.  ([18230](
* [MKLDNN tensor] **(API unstable)**: Added limited (opaque) support for `MKLDNN` tensors via `Tensor.to_mkldnn()`; operators are currently limited to ResNext101 operators.  ([17748](


* `torch.min`, `torch.max`, `torch.median`, `torch.mode`, `torch.kthvalue`, `torch.symeig`, `torch.eig`, `torch.pstrf`, `torch.qr`, `torch.geqrf`, `torch.solve`, `torch.slogdet`, `torch.sort`, `torch.topk`, `torch.gels`, `torch.triangular_solve`, `torch.svd` now return namedtuples describing their outputs. ([16186](, [16950](, [17093](, [17195](, [15429](
* `torch.empty` (and other factory functions): now take a `pin_memory` kwarg; can now pin without going through `torch.Storage` interface..  ([18455](
* `torch.histc`: Now supported on CUDA.  ([15842](
* `torch.unique`: Add `return_counts`.  ([18391](, [18651](
* `torch.logspace`: add the ability to specify a `base`.  ([19542](
* `torch.set_printoptions`: added scientific notation support.  ([16876](
* `torch.btrifact` now handles tensors with greater than 3 dimensions.  ([14964](
* `torch.kthvalue`: now supported on CUDA.  ([17544](
* `torch.abs`: now supported on `uint8` and `int8` dtypes.  ([16893](
* `torch.stack`, ``: now supported for CPU half tensors.  ([16389](
* `torch.cross`: added support for negative dimensions. ([17582](
* `torch.lerp`: add support for `weight` as a Tensor.  ([17348](
* `torch.transpose`: Made consistent with NumPy: 1-d and 0-d arrays are accepted and returned as-is.  ([17462](, [17535](
* `torch.linspace`, `torch.logspace` can now be used with `steps=1` and `start != end`.  ([14748](
* `torch.cholesky`: changed the derivative from a triangular matrix to symmetric matrix.  ([19116](
* `torch.lerp`: Improved numerical stability.  ([18871](
* `torch.logdet`, `torch.slogdet`: improve numerical precision.  ([18449](
* `Tensor.__contains__` is now supported. ([17733](
* `Tensor.fill_` and `torch.zeros` now support half on CPU.  ([17536](
* `Tensor.resize_as_`, `Tensor.view`: now supported on half CPU tensors.  ([18821](
* `Tensor indexing`: allow indexing via NumPy booleans.  ([14932](
* `nn.EmbeddingBag`: enable half precision dense backward.  ([19293](
* `nn.Embedding`: fix dense Embedding to work with double backwards.  ([9078](
* `nn.MaxPool1d`: Allow list and tuples to be passed as `output_size`.  ([16489](
* `nn.CTCLoss`:  support zeroing infinite losses via `zero_infinity` argument.  ([16199](
* `nn.Dropout`: add support for enabling during eval.  ([17549](
* `nn.MSELoss`: add warning about unexpected broadcasting.  ([18349](
* `nn.Module.load_state_dict`: also return `missing_keys` and `unexpected_keys`.  ([18668](
* `nn.parallel.data_parallel`: Enforce devices match `device_ids`.  ([17129](
* `torch.device`: handle in more places that used to accept only device ordinals.  ([14929](
* `dtype.int8` tensors can now be converted to NumPy arrays.  ([14710](
* `nn.functional.gumbel_softmax`: allow multidimensional input with `dim` argument.  ([13339](
* `nn.functional.cosine_similarity`: improved precision.  ([18250](
* `torch.autograd`: Don't keep unnecessary saved_inputs alive, increasing memory efficiency.  ([16583](
* `torch.autograd.profiler`: add Self (non-nested) CPU Time Total, CPU time total ([19378](
* `DataLoader`: support accepting a custom memory pinning function.  ([16743](
* `DataLoader`: retry libshm on EINTR.  ([15964](
* `DataLoader`: fixed an issue with `pin_memory` and `PackedSequence`.  ([18079](
* `data.utils.collate`, `data.utils.pin_memory`: now preserve namedtuples.  ([16440](
* Use `IndexError` instead of `RuntimeError` on many indexing error cases.  ([17049](, [17114](
* Support indexing a `torch.float16` tensor on CPU.  ([17645](
* Add (limited) error checking in case of internal overlap on inplace operators.  ([19317](, [17927](
* `utils.checkpoint.checkpoint`: support `None` as an argument to checkpoint function.  ([17969](
* `torch.autograd`: added more information for `one of the variables needed for gradient computation has been modified by an inplace operation` exception.  ([18523](
* `cuda.synchronize`: add a device argument.  ([19573](
* `cuda.reset_max_memory_*`: now supported.  ([15985](
* `distributions.Independent`:  can now calculate KL Divergence.  ([17681](
* `torch.distributed.new_group`: now supports overriding default backend. ([18595](
* `torch.distributed.init_process_group`: will now propagate timeout to underlying Store. ([16571](
* **[JIT]** Preserve module hierarchy on traced modules. ([15101](
* **[JIT]** Add metadata for TracedModules. ([17311](
* **[JIT]** Improve portability of int and float checks. ([19532](
* **[JIT]** Preserve method parameter names during serialization. ([16750](
* **[JIT]** Add a correctness check for C++ types to custom operators. ([15247](
* **[JIT]** Added a few extra python bindings to help with walking the IR graph from Python. [17822](
* **[JIT Error Messages]** Print out operator suggestions for "unknown builtin op" error. ([15183](
* **[JIT Error Messages]** Better error message when creating a module instance in TorchScript. ([16416](
* **[JIT Error Messages]** Print suggestion to add `nn.Module` attributes to `__constants__` when they are using in TorchScript. ([18164](
* **[JIT Error Messages]** ``: Improve error message when you try to save a ScriptModule. ([15321](
* **[JIT Error Messages]** ``: Improve error message when trying to save a model with Python code. ([16850](
* **[JIT Error Messages]** Better errors when trying to close over a Tensor with grad enabled while tracing. ([18298](, [19645](
* **[JIT Error Messages]** Better error when trying to add a Tensor to `__constants__`. ([16724](
* **[JIT Error Messages]** Better error when a module list isn't added to `__constants__`. ([17167](
* **[JIT Error Messages]** Add a warning when attempting to trace legacy constructors. ([16770](
* **[JIT Error Messages]** Improve hint when trying to trace non-deterministic nodes. ([17957](
* **[C++]** `nn::Module`: added Python interop.  ([13481](
* **[C++]** `autograd::profiler`: is now supported.  ([16580](
* **[C++]** allow detection of C++ ABI flag for cpp extensions from available runtime information.  ([18994](
* **[C++]** `torch.argsort` is now supported in C++.  ([17099](
* **[C++]** `Tensor.isnan`: now supported in C++.  ([15722](
* **[C++]**: Added named submodule support to `nn::Sequential`.  ([17552](
* **[C++]**: Kaiming Initialization.  ([14718](
* **[C++]** `torch::data::transforms::Normalize`: now supported in C++.  ([15891](
* **[C++]**: Support call operator on module holder calling forward.  ([15831](
Random and Sequential distributed samplers.  ([16910](
* **[C++]**: pretty printing of C++ Modules.  ([15326](
* **[C++]** Support serializing `std::vector<torch::Tensor>`.  ([19677](

Bug Fixes

* ``: correct erroneous calculation on large tensors.  ([15653](
* `torch.mean` (and other reductions): fix incorrect calculation on CUDA on large inputs.  ([16023](
* `nn.Conv`: correctly handle non-contiguous inputs on MKLDNN convolution codepath.  ([16300](
* `Tensor.eq_`:  Fix erroneous calculation.  ([15475](
* `torch.mean`: Fix fp16 output calculation.  ([14878](
* `nn.PoissonNLLLoss`:  Properly handle `reduction=None`.  ([17358](
* **[JIT]** Fix bug where custom ops could get optimized out if their outputs weren't used. ([18711](
* **[JIT]** Fix bug where the model serializer would accidentally reorder statements. ([17557](

* `Tensor.round` is now consistently half to even.  ([17443](
* `Tensor.resize_`: Fix some 0-element cases.  ([14874](
* `Tensor.numpy`: Fix conversion of `torch.int8` dtype.  ([15194](
* `Tensor.grad`: correctly handle `del`.  ([16525](
* `Tensor.clamp`: correctly handle NaN on CUDA.  ([15479](
* `Tensor.topk`: properly set launch bounds on CUDA.  ([17296](
* `Tensor.kthvalue`: treat NaN as bigger than any number.  ([17824](
* `Tensor.copy_`: Properly synchronize on src and dst sreams.  ([16966](
* `Tensor indexing`: Fix incorrect dimension error message.  ([16495](
* `Tensor.coalesce`, `Tensor.clone`, `Tensor.to_dense`: fixed for sparse 0-dimensional tensors.  ([17379](
* `torch.isinf`: Don't error out on integral tensors.  ([15489](
* `torch.argsort`, `torch.sort`: Match NumPy by considering NaNs to be larger than any number.  ([15886](
* `torch.geqrf`, `torch.ormqr`: when an `out` parameter is specified, dispatch to the correct function.  ([16964](
* `torch.cuda.get_device_name` / `torch.cuda.get_device_capability`: Fix handling of optional.  ([17222](
* `Tensor.tril_` / `Tensor.triu_`: properly reuse input memory.  ([17031](
* `torch.arange`: fix shape inconsistency between CPU and CUDA.  ([18462](
* `torch.empty` (and other size-based factory functions): properly enforce non-negative sizes.  ([17077](
* `torch.load`: support serializing / deserializing `pathlib.Path` object.  ([18562](
* `nn.BatchNorm`: correctly handle very large batches.  ([17047](
* `nn.Softmax` / `nn.LogSoftmax`: fix double backward for `torch.half`.  ([17330](
* `nn.Softmax`: handle empty inputs in backward.  ([17259](
* `nn.NLLLoss`: Fix crash when `ignore_index` is out-of-bounds on CPU.  ([17328](
* `nn.Softmax`, `nn.LogSoftmax`: handle 0-element inputs.  ([17651](
* `nn.CTCLoss`: correct error checking.  ([16269](
* `nn.Conv`: better report convolution size mismatch.  ([17436](
* `torch.nn.functional.cosine_similarity`: fix output sometimes returning result > 1.0.  ([18168](
* `nn.parallel.data_parallel`: Fix handling of buffers that require_grad.  ([13352](
* `nn.parallel.data_parallel`: would previously sometimes frees tensors before all pending operations finish. ([18465](
* `torch.distributed.broadcast`: fixed repeated calls leading to OOM. ([19219](
* `torch.multiprocessing`: fix serialization of integer `nn.Parameters`.  ([18639](
* `torch.multiprocessing`: Fix handling of `distributions` on CUDA.  ([16854](
* `torch.nonzero`: Fix for 0-dimensional tensors on CUDA.  ([17406](
* `torch.slogdet`: Fix `sign` requiring grad when `input` required grad.  ([16337](
* `torch.cuda.Stream`: Properly restore stream on destination device when switching devices.  ([17439](
* `torch.cuda.Stream`: Fixed synchronization issue when used with non-current device.  ([15689](
* `torch.cuda.Stream`: properly change device in stream context manager.  ([16128](
* `DataLoader`: fixed a hang when no data was read and the buffer size is smaller than the chunk size.  ([17409](
* `DataLoader`: `_utils.collate.default_collate` now converts bool lists to byte Tensors, not integer tensors.
* `DataLoader`: ensure dataset is indexed by integers.  ([17649](
* ``:  Handle transposed dense tensors in backwards.  ([18737](
* `torch.sparse.sum`: Fix parsing of `dim`.  ([16517](
* `` / `torch.sparse.addmm`: fix broadcasting and using uninitialized data.  ([16572](
* `Tensor.to_sparse`: Fix for 0-dimensional tensors.  ([17406](
* `SparseTensor`: fix add with non-contiguous `values` tensors.  ([18179](
* Fix `compare_exchange_weak` in `weak_intrusive_ptr`.  ([16302](
* `utils.model_zoo.load_url`: Fix race condition.  ([16578](
* ``: have `len` properly take into account `num_samples`.  ([15991](
* `torch.distributions`:  Fix precision issue with expansion that prefers `probs` over `logits`.  ([18614](
* `distributions.dirichlet.Dirichlet`: fixed an underflow issue.  ([17488](
* `distributions.binomial.Binomial.log_prob`: fixed numerical stability issue.  ([15962](
* `Caching Allocator`: Free all blocks with outstanding events on OOM-retry.  ([19222](
* `torch.dtype`: fix pickling issue with Python 2.  ([18045](
* ``: Fix SIGCHLD checking.  ([19421](
* `optim.Optimizer`: Properly copy defaults.  ([19308](
* `optim.lr_scheduler.CosineAnnealingLR`: Fix division-by-zero error.  ([19180](
* `optim.lr_scheduler.ReduceLROnPlateau`: fix bug when the argument to `step` is reused outside the function.
* `cudNN`: fix race condition with multiple threads calling into the same device.  ([15080](
* `cudNN`: Properly specify accumulation types.  ([16825](
* `cuDNN`: Fix incorrectly selecting slower algorithms in certain cases.  ([15881](
* `cuFFT`:  Properly handle CUDA contexts.  ([19300](
* Fix infinite loop in reduction functions when get_max_threads is nonzero but num_threads is 1.  ([15114](
* Fix tensor printing bug with Python 2.  ([12732](
* `MKLDNN`: fix thread safety.  ([17022](
* **[JIT]** `floordiv`: Fix integer division and divide-by-zero semantics. ([15813](
* **[JIT]** Fix bug in alias analysis that disabled optimizations even in models without mutation. ([18416](
* **[JIT]** `ord()`: Fix handling of utf8 chars. ([19423](
* **[JIT]** Fix error when too many parameters are passed to a fused CUDA kernel. ([18063](
* **[JIT]** Fix bug where common subexpression elimination accidentally introduced aliasing to function outputs. ([19576](
* **[JIT]** Fix infinite loop in `requires_grad` analysis pass. ([18361](
* **[JIT]** Fix ordering of parameters for in ``. ([18198](
* **[JIT]]** Fix contiguous autodiff and AutoGradZero inconsistency ([18633](
* **[JIT]** Fix error reporting in NVRTC use of the fuser. ([18327](
* **[JIT]** Ensure GIL is acquired before doing module lookup on import. ([17135](
* **[JIT]** Fix bug where `_unique_state_dict` could contain duplicate Tensors. ([18139](
* **[C++]**: Fix module serialization issue where one submodule doesn't have any parameters, but its submodules do.  ([15033](
* **[C++]**: Add `Stream` and `Event` APIs.  ([15937](
* **[C++]**: Fix Module serialization incompatibility between Python and C++ with weight-less layers.  ([19740](
* **[C++]**: Properly pass `extra_cuda_cflags` to C++ extensions on Windows.  ([18638](
* **[C++]** Make SGD semantics match python.  ([15840](
* **[C++]** `torch::nn::init::orthogonal_`: match Python API.  ([18915](

* `torch.btrifact`: the deprecated `info` argument has been removed.  ([14935](
* `torch.potrs` has been deprecated, use `torch.cholesky_solve` instead.  Note that `upper` defaults to `False`  for `torch.cholesky_solve`, and `True` for `torch.potrs`.  ([15334](
* `torch.pstrf` is deprecated; use `torch.cholesky` instead.  Note that `upper` defaults to `False`  for `torch.cholesky`, and `True` for `torch.pstrf`.  ([17866](
* `torch.potri` is deprecated; use `torch.cholesky_inverse` instead.  Note that `upper` defaults to `False`  for `torch.cholesky_inverse`, and `True` for `torch.potri`.  ([19498](
* `torch.btrifact_with_info` has been deprecated; use `` with `get_infos=True` instead.([18435](
* `torch.btrifact` has been deprecated; use the new name `` instead.  ([18435](
* `torch.gesv` is deprecated; use the new name `torch.solve instead.  ([18060](
* `torch.trtrs` has been deprecated; use the new name `torch.triangular_solve` instead.  ([18213](
* `torch. btriunpack` has been deprecated; use the new name `torch.lu_unpack ` instead.  ([18529](
* `torch.btrisolve` has been deprecated; use the new name `torch.lu_solve` instead.  ([18726](
* **[C++]** `IntList` has been deprecated, use `IntArrayRef` instead, as it better describes the type and ownership semantics in C++.  ([16751](
*  **[C++]** Dispatch macros with `Type` parameters, e.g. `AT_DISPATCH_ALL_TYPES(tensor.type(), ...`, are now deprecated; use `ScalarType` instead, e.g. `AT_DISPATCH_ALL_TYPES(tensor.scalar_type(), ...`.  ([17527](, [17996](
* **[C++]** the deprecated `variable_tensor_functions` have been removed.  ([15003](


* `nn.BatchNorm` CPU inference speed increased up to ~19x.([19152](
* `nn.AdaptiveAvgPool`: speed up common-case of size=1 output by ~30x.  ([17011](
* `nn.EmbeddingBag` CPU performance increased by ~4x.  ([19329](
* `Tensor.copy_`: sped up larger tensor copy ~2-3x, small regression in small tensor copy.  ([18618](
* `torch.nonzero`: is now ~2x faster than numpy on CPU.  ([15190](
* Improve caching allocator for Pascal and newer GPUs; 10-20% better memory utilization on Mask-RCNN.  ([17120](
* `reduction functions`: Speed up some large Tensor cases by 50-80%.  ([17428](
* **[JIT]** Graph fuser: better fusion for backwards graphs in the presence of broadcasting. ([14957](
* **[JIT]** Graph fuser: `batch_norm` fusion for inference. ([15146](
* **[JIT]** Graph fuser: `layer_norm` fusion for inference. ([18266](


* `torch.abs`, `torch.frac`, `torch.repiprocal`, `torch.neg` have been vectorized and parallelized ([19041](
* `torch.bmm`: CPU performance increased by 2x.  ([19338](
* `torch.sort`: CUDA performance increased by ~2x.  ([19379](
* `` on CPU is now ~4x faster in the case where inputs are contiguous and `dim` != 0.  ([17032](
* `torch.multinomial` fixed a 2x performance regression.  ([17121](
* `torch.empty` (and another factory functions): reduce overhead by 20-40%.  ([17565](
* `torch.linspace` has been parallelized on CPU.  ([15320](
* `torch.logspace` has been parallelized on CPU.  ([15438](
* `torch.range` has been parallelized on CPU.  ([15484](
* `torch.arange` has been parallelized on CPU.  ([15667](
* `torch.load`: avoid unnecessary CPU-to-CUDA copy.  ([17297](
* `reduction functions`: improve efficiency on CUDA.  ([16224](, [17040](
* Speed up some GEMM cases on CPU by up to 7x.([17730](
* Tensor iterator loop unrolling.  ([17667](
* `sparse/dense matrix multiply`: improve speed by ~5x.  ([16905](
* `distributions.MultivariateNormal`: sped up.  ([17294](
* **[JIT]** Graph fuser: pow scalar exponent / base autodiff, fusion ([19324](
* **[JIT]** Graph fuser: allow fusion of function float arguments. ([18087](
* **[JIT]** Shape analysis: specialize optional Tensor inputs to graphs. ([18360](
* **[JIT]** Shape analysis: various correctness improvements. ([18271](
* **[JIT]** Shape analysis: `aten::_convolution` now participates in shape analysis. ([16837](]
* **[JIT]** Autodiff: coverage for ops used in maskrcnn & BERT. ([16689](
* **[JIT]** Autodiff: support for scalar comparison ops and `randlike`. ([14740](
* **[JIT]** Autodiff: support for `adaptive_avg_pool2d`. ([15459](
* **[JIT]** Autodiff: support for `erf` and `erfc`. ([15139](
* **[JIT]** Autodiff: support for `layernorm`. ([17702](
* **[JIT]** Autodiff: support for `tanh`. ([17816](
* **[JIT]** Autodiff: support for `matmul`/`dropout`. ([17523](
* **[JIT]** Autodiff: specialized CUDA impl for dropout. ([17756](
* **[JIT]** Constant folding: improved inlining of control flow. ([16244](


* `Tensor.scatter_`: add documentation about `value` parameter.  ([17467](
* `Tensor.unfold`: correctly document `dimension` parameter, not `dim`.  ([19020](
* `Tensor.is_floating_point()` is now documented.  ([15704](
* `torch.cholesky`: Fix broken `upper` example in documentation.  ([15215](
* `torch.gesv`: document `out` parameter.  ([15649](
* `torch.mul`: better explain elementwise multiplication.  ([15664](
* `torch.eig`, `torch.symeig`: better explain backwards limitations.  ([15929](
* `torch.ormqr`: fixed output specification.  ([15694](
* `torch.from_numpy`: replaced usage with `torch.as_tensor` in documentation.  ([16587](
* `torch.mvlgamma`: Fix the constant in the docs.  ([17045](
* `torch.mode`: more precisely describe what is returned.  ([17069](
* `torch.upsample`: documentation now matches `torch.interpolate`.  ([17134](
* `torch.arange`: correct `dtype` documentation.  ([18604](
* `torch.cumprod`: document `out` parameter.  ([19340](
* `torch.nonzero`: document indices being returned lexicographically.  ([19539](
* `torch.nn.functional.interpolate`: better explain `aligned_corners` parameter.  ([14806](
* `torch.nn.functional.pad`: documentation has been made consistent with other functional ops.  ([15984](
* `nn.functional.grid_sample`: clarify behavior of padding.  ([19754](
* `nn.TripletMarginLoss`: correct type of `swap` parameter.  ([18115](
* `nn.CrossEntropyLoss`: clarify `ignore_index` documentation.  ([18117](
* `nn.CrossEntropyLoss`: the input format is more clearly explained.  ([15990](
* `nn.CTCLoss`: Clarify a number of ambiguities.  ([18415](
* `nn.BCEWithLogitsLoss`: add better explanation.  ([19212](
* `nn.BCEWithLogitsLoss`: better explain positive samples.  ([17258](
* `nn.ModuleList` / `nn.ParameterList`: update documentation.  ([17731](
* `nn.Module.load_state_dict`: correct semantics of `strict`.  ([17618](
* `nn.parallel.DataParallel`: more accurately specify how different argument types are handled.  ([15993](
* `nn.parallel.DistributedDataParallel`: Clarified batch size requirements.  ([16010](
* `torch.distributed`: Document mixed-precision training.  ([15440](
* `torch.multiprocessing`: Include example multiprocessing code.  ([16345](
* `torch.autograd`: Better explain computing Jacobian-vector product.  ([15197](
* `torch.cuda.get_rng_state`, `torch.cuda.set_rng_state`: document taking a `device` object.  ([14324](
* `torch.device`: Fix example of passing `device` to tensor factory.  ([16839](
* `DataLoader`: update documentation to describe how workers are managed.  ([18091](
* Unified shape formats throughout the documentation.  ([15741](
* Update documentation for `reduction` arguments to use non-deprecated format.  ([17300](
* `mark_non_differentiable`: document correct semantics.  ([17891](
* Warn about memory overlaps on inplace operations.  ([17576](
* Fix a number of small issues with conv and pooling docstrings.  ([17052](
* Fix a number of small issues with padding and activation docstrings.  ([17197](
* **[C++]**: mention packed accessors in Tensor basics.  ([19464](


Exporting More Torch Operators to ONNX

* Export torch.isnan to ONNX ([17698](
* Export torch.flatten to ONNX ([16240](
* Export torch.where, torch.ceil, torch.floor to ONNX ([18571](
* Export torch.narrow to ONNX ([17550](
* Export torch.argmax and torch torch.argmin ([17382](, [18264](, [18261](
* Export adaptive_avg_pool1D, adaptive_avg_pool2D, adaptive_avg_pool3D, adaptive_max_pool1D, adaptive_max_pool2D, adaptive_max_pool3D to ONNX ([17412](
* Export torch.nonzero to ONNX ([17036](, [18047](
* Export torch.erf to ONNX ([16106](
* Export torch.split ([15092](
* Export,, torch.le,, torch.eq, to ONNX ([15677](
* Export torch.expand and to ONNX ([15050](
* Export torch.nn.LogSigmoid to ONNX ([14830](
* Export torch.nn.RReLU to ONNX ([14781](
* Export torch.reshape and torch.reshape_as to ONNX ([16632](, [16971](
* Replace use of ConstantLike with with ConstantOfShape ([16095](, [16214](

Extending Existing Exporting Logic

* Enable dim support in torch.nn.Softmax's export ([18482](
* Support exporting squeeze & unsqueeze with negative dim attribute ([19297](
* Support exporting max_pool1d, max_pool2d, max_pool3d with indices ([16455](
* Add dtype support in torch.logsoftmax and torch.softmax's export ([17672](
* Support ceil_mode in max_pool_1d, max_pool2d, max_pool3d, avg_pool1d, avg_pool2d, avg_pool3d's export ([16769](

Optimizing Exported ONNX Graph

* Add constant folding in ONNX exporter ([18698](
* Retain the parameter names in ONNX exporter ([17551](
* Omit slice op if it is a non-op ([19155](
* Add a flag to strip doc_string from exported ONNX models ([18882](
* Omit torch.dropout if the model is in eval mode ([16547](

Adding Utility Functions and Refactoring

* Remove unused arg f from _model_to_graph(). ([19647](
* Add the support for stable ONNX opsets in exporter ([16068](, [17419](
* Set the default ONNX opset to the latest stable opset (i.e., 9) ([17736](
* Add an utility function to check whether it's in the middle of ONNX export or not ([19050](
* Refactoring serialization of ONNX initializers to be name-based ([17830](
* Expose dim() on type and use it in ONNX symbolics ([15933](
* Add scalar_type_to_pytorch_type dict in ONNX symbolic ([15965](
* Add an assertion to check the number of the parameters passed to ONNX exporter ([18145](


* Fix different types in rsub caused bug ([15707](
* Fix list structure supports in ONNX exporter ([19102](
* Fix case for `activations` attribute in nn.RNN ONNX export. ([19368](
* Minor fix for onnx ConstantOfShape export ([18199](
* Fix the torch.(reduce)min and torch.(reduce)max's export ([15241](
* Fixing ONNX export of logical ops to have correct output datatype ([15185](
* Fix typo in docstring ([18216](


Note: our conda install commands have slightly changed. Version specifiers such as `cuda100` in `conda install pytorch cuda100 -c pytorch` have changed to `conda install pytorch cudatoolkit=10.0 -c pytorch`

Breaking Changes

There are no breaking changes in this release.

Bug Fixes


- Higher order gradients for CPU Convolutions have been fixed (regressed in 1.0.0 under MKL-DNN setting) 15686
- Correct gradients for non-contiguous weights in CPU Convolutions 16301
- Fix ReLU on CPU Integer Tensors by fixing vec256 inversions 15634
- Fix bincount for non-contiguous Tensors 15109
- Fix torch.norm on CPU for large Tensors 15602
- Fix eq_ to do equality on GPU (was doing greater-equal due to a typo) (15475)
- Workaround a CuDNN bug that gave wrong results in certain strided convolution gradient setups
- blacklist fft algorithms for strided dgrad (16626)


- Fix cuda native loss_ctc for varying input length (15798)
- this avoids NaNs in variable length settings
- C++ Frontend: Fix serialization (15033)
- Fixes a bug where (de-)/serializing a hierarchy of submodules where one submodule doesn't have any parameters, but its submodules do
- Fix derivative for mvlgamma (15049)
- Fix numerical stability in log_prob for Gumbel distribution (15878)
- multinomial: fix detection and drawing of zero probability events (16075)


- PyTorch binaries were [crashing on AWS Lambda]( and a few other niche systems, stemming from CPUInfo handling certain warnings as errors. Updated CPUInfo with relevant fixes.
- MKL-DNN is now statically built, to avoid conflicts with system versions
- Allow ReadyQueue to handle empty tasks (15791)
- Fixes a segfault with a DataParallel + Checkpoint neural network setting
- Avoid integer divide by zero error in index_put_ (14984)
- Fix for model inference crash on Win10 (15919) (16092)
- Use CUDAGuard when serializing Tensors:
- Before this change, `` and `torch.load` would initialize the CUDA context on GPU 0 if it hadn't been initialized already, even if the serialized tensors are only on GPU 1.
- Fix error with handling scalars and __rpow__, for example `1 ^^ x`, where x is a PyTorch scalar (16687)
- Switch to CUDA implementation instead of CuDNN if batch size >= 65536 for affine_grid (16403)
- CuDNN crashes when batch size >= 65536
- [Distributed] TCP init method race condition fix (15684)
- [Distributed] Fix a memory leak in Gloo's CPU backend
- [C++ Frontend] Fix LBFGS issue around using inplace ops (16167)
- [Hub] Fix github branch prefix v (15552)
- [Hub] url download bugfix for URLs served without Content-Length header


- LibTorch binaries now ship with CuDNN enabled. Without this change, many folks saw significant perf differences while using LibTorch vs PyTorch, this should be fixed now. [14976](
- Make btriunpack work for high dimensional batches and faster than before (15286)
- improve performance of unique with inverse indices (16145)
- Re-enable OpenMP in binaries (got disabled because of a CMake refactor)


- create type hint stub files for module torch (16089)
- This will restore auto-complete functionality in PyCharm, VSCode etc.
- Fix sum_to behavior with zero dimensions (15796)
- Match NumPy by considering NaNs to be larger than any number when sorting (15886)
- Fixes various error message / settings in dynamic weight GRU / LSTMs (15766)
- C++ Frontend: Make call operator on module holder call forward (15831)
- C++ Frontend: Add the normalize transform to the core library (15891)
- Fix bug in torch::load and unpack torch::optim::detail namespace (15926)
- Implements Batched upper triangular, lower triangular (15257)
- Add torch.roll to documentation (14880)
- (better errors) Add backend checks for batch norm (15955)


- Add better support for bools in the graph fuser (15057)
- Allow tracing with fork/wait (15184)
- improve script/no script save error (15321)
- Add self to Python printer reserved words (15318)
- Better error when torch.load-ing a JIT model (15578)
- fix select after chunk op (15672)
- Add script standard library documentation + cleanup (14912)


tensor([ 2.,  1.])

Windows support

PyTorch now officially supports Windows. We provide pre-compiled Conda binaries and pip wheels for Python 3.5 and 3.6.
PyTorch on Windows doesn't support `distributed` training and might be a tad bit slower than Linux / OSX because Visual Studio supports an older version of OpenMP.

As always, you can use the commands at to install PyTorch on Windows
We have an FAQ that answers most questions you might have around Windows here:

ONNX Improvements

New ONNX operators
- Support export `torch.max(input, dim)` and `torch.min(input, dim)` [6220](
- Add symbolic for `ReLU` to support exporting to ONNX [5759](
- Add `sum`, `prod`, `sqrt` and improve `log_softmax` [4579](
- Add ONNX support for `InstanceNorm` [4626](
- Add ONNX symbolic for `Elu` [3453](
- Add ONNX symbolic for `UpsamplingNearest2d` [3450](

- Print source location when ONNX export fails for a node [5652](
- Export onnx protobuf bindings to python [6651](
- Support `output_padding` in `ConvTranspose` [4583](

Better RNN support
PyTorch can now export a subset of RNNs to ONNX [4409](

- Add Elman RNN export to ONNX [4613](
- Support batch-first in ONNX export of padded sequences [5360](
- Bidirectional Elman RNN export to ONNX [5120](
- Handle sequence lengths correctly when exporting RNNs to ONNX [4695](
- Support GRU export to ONNX [4390](

- Fix a bug in ONNX symbolic of 3d average pooling [6101](
- Fix onnx export of replication/reflection pad [4263](

Miscellaneous improvements
* implement ``__dir__`` for Tensors, so that editors can automatically auto-complete and query for the possible fields in Tensors

* Add ``numpy()`` and ``from_numpy()`` to ``HalfTensor``
* Enable `TensorDataset` to have any number of input tensors.

* Add `padding_value` to `torch.nn.utils.rnn.pad_sequence`
* Add `total_length` option to `pack_padded_sequence`, which is useful when using `DataParallel`, as we can ensure that we have sequences of the same length.
* Improve numerical precision of `torch.arange`, making it consistent with `numpy.arange`
* `torch.load()` and `` support arbitrary file-like object
* `torch.nn.functional.grid_sample` now supports 2D (spatial) and 3D (volumetric) inputs
* set python random seed in `DataLoader` workers, in order to improve experiment reproducibility

* Add `__delitem__` to `nn.Sequential`. Now one can delete arbitrary elements of a `nn.Sequential`.

For example:

model = nn.Sequential(nn.Linear(2, 2), nn.ReLU(), nn.Linear(2, 2))
del model[1]   deletes nn.ReLU

* `ReduceLROnPlateau` is now serializable [5300](

* Add option to flush denormal numbers on CPU. [5294](
* PyTorch now exposes the gradients of conv1d, conv2d and conv3d with respect to the input and the weights [5408](
* Add support for calling `pack_padded_sequence` with either list or with a Tensor [5133](
- Support negative indexing for ``padding_idx`` in ``nn.Embedding`` [4496](
- Implement backward pass for ``pack_padded_sequence`` [4512](
- Add ``nn.utils.rnn.pad_sequence`` and ``nn.utils.rnn.pack_sequence`` to pad lists of variable length Tensors with ``0`` and to pack a list of variable length Tensors.
- Add ``torch.cuda.memory_cached``, ``torch.cuda.max_memory_cached``, ``torch.cuda.memory_allocated``, and ``torch.cuda.max_memory_allocated`` methods
for checking CUDA memory usage [4511](
- Allow viewing on noncontiguous tensors if the new view size is compatible with the tensor's original size and stride. [4062](
- ``NLLLoss`` and ``CrossEntropyLoss`` now support more than 2 dimensions. [4654](

- Add an option to not show ``model_zoo`` download progress bar [4135](
- You can now assign modules to indices of ``nn.Sequential``. [4931](
- You can create tensors with a numpy ``np.longlong`` array [4367](
- Change the autograd execution order to use good heuristics. This greatly improves memory usage for large models. [4746](

- Add AMSgrad mode to ``Adam`` and ``SparseAdam`` optmizers. [4034](

- Better ``torch.autograd.profiler`` support for CUDA profiling using the ``cudaEvent`` API. [3734](

- ``torch.set_num_threads`` also sets the respective MKL option so you won't need to use an environment variable to control it. [4949](

Performance improvements

- Speed up CPU ``nn.EmbeddingBag``, making training overall 30% faster [5433](
- Move ``nn.MarginRankingLoss``, `nn.CosineEmbeddingLoss`, `nn.HingeEmbeddingLoss`, and `nn.TripletMarginLoss` from Python to our ATen backend, resulting in some cases up to a 3x performance gains.
[5346](,  [5646](, [5080](, [5680](
- Implement ``pin_memory()`` as a NativeFunction [4094](
- Save ``self.numel()`` for backward computation instead of ``self`` to save memory [5747](
- Rearrange dimensions for pointwise operations for up to 10x better performance in one case. [4174](
- Vectorize `normal_` for a 5-6x speed up in a small case [4312](
- Allowing usage of GPU Direct within PyTorch for the Broadcast operation [4183](
- Speed-up ``nn.Linear`` for the 3D input case [5279](
- Speed up `Conv3D` on the CPU by parallelizing ``vol2col`` and ``col2vol`` [4824](
- Add AVX2 implementation for sigmoid function, showing around 10x speedup [5010](
- Use fast integer division algorithm to avoid division ops inside kernels. [5054](
- Improve occupancy for CUDA random number generation [5710](
- Add optimization to norm for common norms [5722](
- Add a fast fused GLU backward [5782](
- Optimize unique sorting by using ``std::vector+sort`` instead of ``std::set``, giving up to 5x speedup. [5913](
- Speed up sum over a dimension [6026](
- Enable MKLDNN convolution forward and backward. [6062](
- Parallelize non-contiguous point-wise operations with OpenMP [2764](
- Add cudnn Tensor Core ops to RNNs for Volta [3409](
- Vectorize ``exp``, ``log``, ``sin``, ``cos`` [6078](
- Reuse intermediate results over multiple backwards grad_inputs [3526](

- DistributedDataParallel: 10% of NCCL backend perf improvements with mixed-precision support [5064](
- Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance [4870](

Bug fixes

torch operators
- Improve ``torch.digamma`` precision near poles [6517](
- Fix incorrect behavior of ``Tensor.random_`` on negative inputs [6463](
- Fix undefined behavior in backward pass for ``tensor.permute(dims)`` with negative dims [5945](
- Fix integer overflow in ``torch.remainder`` operator (it would break with a divisor above ``2**48``) [5906](
- Fix memory leak in ``torch.bmm`` [5744](
- Make dimension checker of `scatter_add_` consistent with `scatter_`'s [5659](
- Fix CPU ``torch.multinomial`` with noncontiguous probability tensor input (previously, it would overwrite input data)[5093](
- Fix CUDA ``torch.multinomial`` using incorrect strides and being able to select zero-probability events. [5774](, [5238](
- Support empty index tensor for ``index_select`` [3429](
- Support empty indices tensor in CUDA ``Tensor.put_`` [4486](
- Improve stability of ```` with empty tensors [3602](, [5971](, [5819](
- Fix ``torch.fft`` in the case where any of the input dimensions is not aligned [6118](
- Improve the CUDA btrifact error message [5644](
- Return zeros for eigenvector tensor when not requested in ``torch.symeig``[3411](
- Fix ``torch.btrifact`` on tensors. [4318](
- Fix ``torch.pstrf`` on tensors. [4883](
- Fix memory leak in `torch.median` [6889](
- Fix SVD backward on non-square matrices when `some=False` [6870](

- Detect re-initialization of ``_C`` shared library that would often result in segfaults on exit [6232](
- Fix indexing with all zero ByteTensors [3926](
- Only allow dense floating-point types as the default tensor type. [5674](
- Initialize CUDA before setting CUDA tensor types as default to prevent crash [4788](
- Fix a bug where ``from_dlpack`` fails if CUDA is not initialized. [4182](
- Fix crash in creating a CUDA tensor with a numpy array [5850](
- Fix broken sharing of empty tensor in multiprocessing on some OSes [6229](

- Restore allow_unused functionality: throw error when differentiated input is unused or unreachable. [6553](
- Fix ``output_nr`` not being incremented correctly. This caused crashes in the backward pass of operations that don't ``requires_grad`` on some inputs. [4812](
- Fix nvprof parsing in the ``torch.autograd.profiler`` [5840](

nn layers
- Support only specifying size in certain dimension for adaptive pooling [3127](
- Fix reflection padding boundary checks to not cause invalid memory access [6438](
- Improve error messages for ``NLLLoss``. [5299](, [6072](
- Fix ``kl_div`` backward on CUDA. Previously it would not respect ``gradOutput`` when computing ``gradInput``. [5814](
- Fix incorrect ``bias`` size assert for ``Linear`` [5992](
- Fix incorrect ``nn.functional.convNd`` and ``nn.functional.conv_transposeNd`` error message [5701](
- Check that shape for input and target matches instead of number of elements for some loss functions [5085](
- Fix ``torch.diag`` backward returning square grad with non-square input [4538](
- Fix convolution type mismatch error message [5815](
- Add ``align_corners`` option to linearly interpolating upsampling and make the default upsampling behavior more consistent with other frameworks [5927](
- Prevent numerical issues with ``poisson_nll_loss`` when log_input=False [3336](

- Ensure convolution weights are contiguous to fix CUDA ``ConvTranspose`` double backward [4543](
- Fix CUDA double backwards [4460](

- Fix embedding with ``sparse=True`` [4686](
- Fix sparse embedding backward when input contains only ``padding_idx`` [6211](
- Handle copying empty sparse tensors to/from CPU, GPU. [5361](

- Add argument checks to the  ```` classes, fixing a bug where ``DataLoader`` tries to load the entire dataset on non-integer ``batch_size``. [6249](
- Set ``dataloader.batch_size = None`` when batch_sampler is given, fixing a bug where ``DataLoader`` would report ``batch_size`` as ``1``. [6108](
- Improve signal handling in ``DataLoader`` [4643](
- Ignore ``FileNotFoundError`` when shutting down [5380](
- Make preprocessing deterministic [4640](

- Cast tensors when loading optimizer state dicts to improve usability [3658](
- List model parameters in deterministic order to improve stability of ``load_state_dict()`` [6031](
- Add parameter range checks for all optimizers [6000](
- Fix ``AMSGrad`` mode for ``SparseAdam`` [4314](

distributed and multi-gpu
- Fix a number of distributed training errors caused by a detach in place error [5829](
- Don't modify requires_grad when running DataParallel in no_grad mode [5880](
- Add GPU guard for ``broadcast_coalesce`` for Distributed Data Parallel stability [5655](


Table of Contents

* **Highlights**
* Brand New Distributed Package
* C++ Frontend [API Unstable]
* Torch Hub
* **Breaking Changes**
* **Additional New Features**
* N-dimensional empty tensors
* New Operators
* New Distributions
* Sparse API Improvements
* Additions to existing Operators and Distributions
* **Bug Fixes**
* Serious
* Backwards Compatibility
* Correctness
* Error checking
* Miscellaneous
* **Other Improvements**
* **Deprecations**
* CPP Extensions
* **Performance**
* **Documentation Improvements**



The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It allows for the creation of models that can run without a dependency on the Python interpreter and which can be optimized more aggressively. Using program annotations existing models can be transformed into Torch Script, a subset of Python that PyTorch can run directly. Model code is still valid Python code and can be debugged with the standard Python toolchain. PyTorch 1.0 provides two ways in which you can make your existing code compatible with the JIT, using `torch.jit.trace` or `torch.jit.script`. Once annotated, Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.

Write in Python, run anywhere!
def RNN(x, h, W_h, U_h, b_h):
y = []
for t in range(x.size(0)):
h = torch.tanh(x[t]  W_h + h  U_h + b_h)
y += [h]
return torch.stack(y), h

As an example, see a tutorial on [deploying a seq2seq model](,
[loading an exported model from C++](, or [browse the docs](

Brand New Distributed Package

The [torch.distributed]( package and [torch.nn.parallel.DistributedDataParallel]( module are backed by a brand new re-designed distributed library.  The main highlights of the new library are:
* New `torch.distributed` is performance driven and operates entirely asynchronously for all backends: `Gloo`, `NCCL`, and `MPI`.
* Significant Distributed Data Parallel performance improvements especially for hosts with slower networks such as ethernet-based hosts
* Adds async support for all distributed collective operations in the [torch.distributed]( package.
* Adds the following CPU ops in the Gloo backend: [send](, [recv](, [reduce](, [all_gather](, [gather](, [scatter](
* Adds [barrier]( op in the NCCL backend
* Adds [new_group]( support for the NCCL backend

C++ Frontend _**[API Unstable]**_.

The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to `torch.nn`, `torch.optim`, `` and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:

<p align="center">
<table align="center">
<tr valign="top">
<td><sub><pre lang="python">
import torch
model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
<td><sub><pre lang="cpp">
include &lt;torch/torch.h&gt;
torch::nn::Linear model(5, 1);


**This is a pre-release preview, do not rely on the tag to have a fixed set of commits, or rely on the tag for anything practical / important**

Table of Contents

* [Highlights](highlights)
* [JIT](jit)
* [torch.distributed new "C10D" library](torchdistributed-new-c10d-library)
* [C++ Frontend [API Unstable]](c-frontend-api-unstable)
* [Breaking Changes](breaking-changes)
* [Additional New Features](additional-new-features)
* [N-dimensional empty tensors](n-dimensional-empty-tensors)
* [New Operators](new-operators)
* [New Distributions](new-distributions)
* [Additions to existing Operators and Distributions](additions-to-existing-operators-and-distributions)
* [Bug Fixes](bug-fixes)
* [Serious](serious)
* [Backwards Compatibility](backwards-compatibility)
* [Correctness](correctness)
* [Error checking](error-checking)
* [Miscellaneous](miscellaneous)
* [Other Improvements](other-improvements)
* [Deprecations](deprecations)
* [CPP Extensions](cpp-extensions)
* [Performance](performance)
* [Documentation Improvements](documentation-improvements)



The JIT is a set of compiler tools for bridging the gap between research in PyTorch
and production. It includes a language called Torch Script (don't worry it is a subset of Python,
so you'll still be writing Python), and two ways in which you can make your existing code compatible with the JIT.
Torch Script code can be aggressively optimized and it can be serialized for later use in our new C++ API, which doesn't depend on Python at all.

Write in Python, run anywhere!
def RNN(x, h, W_h, U_h, b_h):
y = []
for t in range(x.size(0)):
h = torch.tanh(x[t]  W_h + h  U_h + b_h)
y += [h]
return torch.stack(y), h

As an example, see a tutorial on [deploying a seq2seq model](,
[loading an exported model from C++](, or [browse the docs](

torch.distributed new "C10D" library

The [torch.distributed]( package and [torch.nn.parallel.DistributedDataParallel]( module are backed by the new "C10D" library.  The main highlights of the new library are:
* C10D is performance driven and operates entirely asynchronously for all backends: `Gloo`, `NCCL`, and `MPI`.
* Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts
* Adds async support for all distributed collective operations in the [torch.distributed]( package.
* Adds [send]( and [recv]( support in the Gloo backend

C++ Frontend _**[API Unstable]**_.

The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to `torch.nn`, `torch.optim`, `` and other components of the Python frontend. Here is a minimal side-by-side comparison of the two language frontends:

<p align="center">
<table align="center">
<tr valign="top">
<td><sub><pre lang="python">
import torch
model = torch.nn.Linear(5, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
prediction = model.forward(torch.randn(3, 5))
loss = torch.nn.functional.mse_loss(prediction, torch.ones(3, 1))
<td><sub><pre lang="cpp">
include &lt;torch/torch.h&gt;
torch::nn::Linear model(5, 1);









* Add `allgather_coalesced` API to `ProcessGroup` ([28634,]([29059](
* Add `abort` API in `ProcessGroupGloo` Send/Recv Work ([29928](
* Add `--no_python` flag to allow using a bash script wrapper in the launch command ([29144](

RPC [Experimental]

`torch.distributed.rpc` is a newly introduced package. It contains basic building blocks to run functions remotely in model training and inference, which will be useful for scenarios like distributed model parallel or implementing parameter server frameworks. More specifically, it contains four pillars: RPC, Remote Reference, Distributed Autograd, and Distributed Optimizer. Please refer to the [documentation]( and the [tutorial]( for more details.

* Add `rpc_sync` and `rpc_async` for builtin operators and Python user functions ([23228](, [23569](, [28392](
* Add `remote` and `RRef` for builtin operators and Python user functions ([25169](, [25499](
* Distributed Autograd - FAST mode backward pass implementation. ([27022](, [27576](
* Integrate `remote` and `RRef` with distributed autograd ([28630](, [28656](
* Add a distributed optimizer ([29304](, [30062](
* Add python API for `get_gradients()` method to retrieve gradients from distributed autograd context. ([28926](
* Support creating local `RRef`s on local values and to-self `remote` calls ([28948](, [29634](
* Support custom pickler for RPC ([30185](
* Add default RPC agent options based on the backend type ([30201](
* Add local `shutdown` to `ProcessGroup` agent ([30330](


* `script::Module`: implement more of of the nn.Module API ([28828](
* In particular, adds the (optionally recursive) methods that iterate over submodules, parameters, etc.
* Adds a pybind-like `attr()` method to simplify attribute access.
* Add support for `staticmethod` on `ScriptModule`s ([27163](
* Support Module Containers as Iterables ([26465](
* Support Iterables In List Comprehensions ([26768)](
* Dictionaries now preserve insertion order, and `OrderedDict` is supported ([26465](
* Add support for `hasattr()` ([29332](
* TorchScript classes can now be callable ([26743](
* Add `clone_instance` for `ScriptModule`s ([30168](
* Add `torch.memory_format` support to the TorchScript ([28544](
* Custom `forward()` is now allowed on container modules ([28988](
* Calls to submodules are now preserved in the traced graph ([29261](
* Add support for module containers to be used as iterables ([28255](
* Make JIT Serialization support arbitrary std::function<> IO ([28039](
* Support `layout() `in script ([27100](
* Methods and functions are no longer inlined in the serialized file format ([26706](


* Build level customization
* Add custom build script to only include selected operators ([30144](
* Dump operator names used by a script module ([29374](, [30467](
* Disable JIT optimizer in Android wrapper for mobile custom build ([30285](
* FBJNI Gradle ABI_FILTERS parameter ([30135](




* Add timeout support in `ProcessGroupNCCL` ([27224](
* Ensure that DDP wrapped module has parameters that require gradients ([25858](
* Making `torch/csrc/cuda` NCCL usage safe for NCCL 2.5 ([29014](
* Enable `test_distributed` for ROCm but only with NCCL backend ([28814](

RPC Improvements

* Separate out RPC to `rpc_sync` and `rpc_async` APIs ([26570](
* Make python user function serialization format to be consistent with builtin operators ([27136](
* Clean up distributed autograd context on all participants on exit ([27951](
* Improve error handling for distributed autograd engine. ([27940](
* Scope pybind11 functions to `torch.distributed.{autograd,rpc}` ([27529](
* Lift `rpc_timeout` to `RpcAgent` to make it reusable for other `RpcAgent` implementations. ([29341](
* Support sending message to self in `process_group_agent` ([29253](
* Properly shutdown RPC even in the case of `clean_shutdown=False`. ([29148](
* Ensure `initializedContextIds_` map is cleaned up appropriately in distributed autograd engine. ([29787](
* Add hash and equality operators for `WorkerInfo` ([29958](
* Add `RpcAgentOptions` struct type to bundle arguments for different `RpcAgent`s ([29972](
* Mark timeout `FutureMessage`s and throw exceptions in `ProcessGroupAgent` ([29601](
* Re-throw python remote exception when using remote reference to itself ([29930](
* By default ignore `RRef` leaks during shutdown ([30217](


* Add Design doc for Distributed Autograd Engine ([29175](, [30068](, [29927](
* Add Design doc for Remote Reference ([30066](
* Add documentation page for `torch.distrbuted.rpc` ([29276](, [28030](, [29971](, [30160](, [30050](, [30069](, [30179](, [30218](, [30240](, [30243](, [30259](


* Add known worker IDs to distributed autograd context ([26324](
* Minor tweaks to RPC message API ([28326](
* Rename `PythonUDF{Call,Resp}` ([27530](
* Use `std::shared_ptr` for `DistAutogradContext` ([29770](
* Mark `c10d::~NCCLUtils` as noexcept ([29118](


* Move custom passes to last optimization step ([29256](
* Represent the original Python name of a module type the same way in traced and scripted modules. ([29912](
* Only print original SourceRange on highlight ([29708](
* Error message and ergonomic improvements:
* Show full call stack in TorchScript exception even when calls were inlined. ([29911](
* Reduce error context from 10 -> 3 ([26765](
* Fix error report highlight for unmatched type annotation ([27195](
* Make default string arguments in schemas human readable ([27088](
* Print which output didn't have dependence during trace checking. ([29047](
* Improvements to save/load and serialization performance:
* Modules can now share JIT types if their implementation is the same, improving save/load performance ([26666](
* Improve float pickling speed. ([28553](
* Pickler: convert `std::stringstream` cases for improved performance. ([29351](
* Buffer to speed Unpickler ([27727](
* Buffer in Pickler to improve performance. ([27720](
* In `torch::save()` avoid zip compressing small header records. ([28180](
* String optimizations related to serialization. ([28230](
* Clean up serialized source format ([28129](
* API for finding a common ancestor block for a pair of nodes ([28864](
* Make inserted child module names unique ([27237](
* Better hashing for constant pool ([27733](
* Improve error messages when a method or attribute is missing ([27110](
* Display original source range in `Node::print` ([27524](
* Always use the closure to resolve variable names ([27515](


* Improve Java API / JNI
* Add module method to allow explicitly destructing native part ([27090](
* Add methods to write image tensor content to buffer ([27359](
* Various improvements to Android API ([27454](, [27455](
* Add support for PyTorch JNI build ([29412](, [42faf961c8](, [d22f61432d](
* Various fixes to PyTorch JNI ([29350](, [29861](, [30206](, [30207](
* Improve support for older Android NDK
* Introduce math_compat.h for older Android versions ([28567](
* Define std::strtoll for older Android ([28603](
* Improve error message, documentation, debuggability
* Enable full error message for mobile builds ([29926](
* Update iOS ([27145](
* Update Android ([28533](
* Rename function parameters to avoid [-Werror,-Wshadow] ([30276](
* Fix exception message in Java Tensor ([30776](
* Improve support for benchmark and profiling
* Add Android and iOS test app for benchmark and profiling ([28405](, [28406](, [28469](, [28622](
* Integration with mobile benchmark in PEP ([28437](
* Subscribe for record function and if android do atrace ([28708](
* Improve build / CI
* Improve Android Gradle build and publishing ([26833](, [27389](, [29262](, [29738](
* Misc fixes to the Android test project ([27453](
* Improve XCode build script ([27358](, [28996](, [29002](
* Add testing code to iOS CI jobs ([27593](, [27594](, [27784](, [30133](
* Misc fixes to the iOS TestApp ([27591](, [28356](, [28809](, [29247](, [29962](, [29963](
* Add support for host build to pytorch_android ([27662,]([27664](
* Add host build Gradle publishing ([29749](
* Add mobile build CI with host toolchain ([30292](

Named Tensors

* `torch.addcdiv`, `torch.addcmul` Added named tensor support ([28975](
* `torch.{ones,zeros,full,rand,randn}_like` Added named tensor support ([28981](
* `torch.cdist` Added named tensor support ([29129](
* `torch.equal` Added named tensor support ([29322](
* Added named tensor support for comparison ops ([27162](
* `Tensor.align_to` Fixed error message ([27221](
* `Tensor.align_to` Make method-only. ([27304](
* `Tensor.align_to` Accept partially named tensors ([27308](
* `torch.mean(Tensor, Dimname)` Fixed autograd support ([29199](
* `Tensor.unflatten` Fix when dim is a negative integer (31208) ([31432](
* Fix type errors in examples about Named Tensor ([27828](


New torch::nn modules

* Convolution layers
* torch::nn::ConvTranspose{1,2,3}d / Unfold ([29721]( ([27809](
* Pooling layers
* torch::nn::AdaptiveAvgPool{1, 2, 3}d / MaxUnpool{1, 2, 3}d / LPPool{1, 2}d / FractionalMaxPool{2,3}d ([26808](, [26818](, [26819]( ([26896](, [26915](, [27027]( ([27800](, [28492](, [29584]( ([29933](
* Loss layers
* torch::nn::HingeEmbeddingLoss / CosineEmbeddingLoss /MultiMarginLoss ([27101]( ([27345]( ([27424]( ([27770](
* torch::nn::TripletMarginLoss / SoftMarginloss / MultiLabelMargin / MarginRankingLoss / MultiLabelSoftMarginLoss ([27713](, [27956]( ([27660]( ([27659]( ([29000]( ([27669](
* torch::nn::MSELoss / KLDivLoss / BCELoss / SmoothL1Loss / PoissonNLLLoss / BCEWithLogitsLoss ([27156]( ([28806]( ([30146]( ([27661]( ([28755]( ([28783](
* torch::nn::NLLLoss / CrossEntropyLoss / CTCLoss ([29812]( ([28654](
* Normalization Layers
* torch::nn::LayerNorm / InstanceNorm{1,2,3}d / BatchNorm{1,2,3}d / GroupNorm / LocalResponseNorm / CrossMapLRN2d ([28032]( ([28790]( ([28176](, [28936]( ([29920]( ([28759]( ([29039](
* Activation Layers
* torch::nn::ELU / LeakyReLU / SELU / PReLU / ReLU / ReLU6 / RRelu / CELU / GLU ([27028)]( ([27059]( ([27434]( ([27429]( ([27435]( ([27436]( ([27437]( ([27487]( ([29922](
* torch::nn::Sigmoid / LogSigmoid / LogSoftmax / Softmax / Softmax2d / Softplus / Softmin / Softsign / Softshrink / Hardshrink / Hardtanh / Tanh / Threshold ([27488]( ([27060]( ([27462]( ([27446]( ([27509]( ([27489]( ([27459]( ([27535]( ([27534]( ([27035]( ([27537]( ([27038]( ([27536]( ([27538](
* Dropout Layers
* torch::nn::Dropout / Dropout{2, 3}d / AlphaDropout / FeatureAlphaDropout ([29761]( ([28424](
* Padding Layers
* torch::nn::ReflectionPad{1, 2}d / ReplicationPad{1,2,3}d / ZeroPad2d / ConstantPad{1,2,3}d ([28538]( ([28539]( ([28540]( ([28541](
* Embedding layers
* torch::nn::Embedding / EmbeddingBag ([26358](
* Linear layers
* torch::nn::Bilinear / Flatten ([26082]( ([28072](
* Vision layers
* torch::nn::Upsample / PixelShuffle ([28413]( ([28140](

New torch::nn::functional functions

* Convolution functions
* torch::nn::functional::conv{1,2,3}d / conv_transpose{1,2,3}d / fold / unfold ([28917]( ([29721]( ([28732]( ([27809](
* Pooling functions
* torch::nn::functional::adaptive_avg_pool{1, 2, 3}d / lp_pool{1, 2}d / fractional_max_pool{2, 3}d / fractional_max_pool{2, 3}d_with_indices ([26808](, [26818](, [26819]( ([27800](, [28492]( ([29584]( ([29933](
* Loss functions
* torch::nn::functional::hinge_embedding_loss / multi_margin_loss / multilabel_soft_margin_loss / triplet_margin_loss / soft_margin_loss / margin_ranking_loss ([27101]( ([27424]( ([27669]( ([27713]( ([27660]( ([29000](
* torch::nn::functional::poisson_nll_loss / nll_loss / cross_entropy / binary_cross_entropy_with_logits ([28755]( ([29812]( ([28783](
* torch::nn::functional::l1_loss / kl_div / mse_loss / binary_cross_entropy / smooth_l1_loss / ctc_loss ([27156]( ([28806]( ([30146]( ([27661]( ([28654](
* Normalization functions
* torch::nn::functional::layer_norm / instance_norm / clip_grad_norm_ / batch_norm / group_norm / local_response_norm / normalize ([28032]( ([28790](, [30684]( ([26140](, [29584](, [30216]( ([28176](, [28936]( ([29920]( ([28759]( ([27280](
* Activation functions
* torch::nn::functional::elu / leaky_relu / selu / prelu / relu / relu6 / rrelu / celu / glu / gelu ([27028]( ([27059]( ([27434]( ([27429]( ([27435]( ([27436]( ([27437]( ([27487]( ([29922]( ([28433](
* torch::nn::functional:: log_sigmoid/ log_softmax / softmax / softplus / softmin / softsign / softshrink / hardshrink / tanhshrink / hardtanh / gumbel_softmax / threshold ([27060]( ([27462]( ([27446]( ([27489]( ([27459]( ([27535]( ([27534]( ([27035]( ([27537]( ([27038]( ([28121]( ([27538](
* Embedding functions
* torch::nn::functional::embedding  / embedding_bag / one_hot ([28669]( ([29673]( ([27177](
* Linear functions
* torch::nn::functional::linear / bilinear ([27382]( ([26082](
* Padding functions
* torch::nn::functional::pad ([26601](, [28760](
* Vision functions
* torch::nn::functional::affine_grid / grid_sample / interpolate / pixel_shuffle ([27263]( ([28354]( ([28413]( ([28140](
* Distance functions
* torch::nn::functional::pdist ([27122](
* Utility functions
* torch::nn::utils::clip_grad_value_ / parameters_to_vector / vector_to_parameters ([28736](, [29584]( ([30216]( ([29267](

AMD Support

* New features integration
* Enabled RCCL Integration ([23884](, [27383](, [27518](, [29385](
* Enabled rocTX and rocTracer Integration ([27416](
* Improved hiprtc integration ([27390](
* bfloat16 enablement (initial) on ROCm ([27719](
* Build/CI
* Upgrade to ROCm 2.9 ([27417](
* Upgrade ROCm CI to Python3.6 ([30119](, [27353](
* Distribute hipify scripts as part of torch package ([27425](
* Build and test gfx908 architecture ([27388](
* Add `torch.version.hip` ([29815](
* Build fixes ([29547](, [29009](


In PyTorch 1.4, we have mainly focused on expanding the coverage for ONNX Opset 11, and enabling exporting torchvision models. Most of the torchvision models can be exported to ONNX (Opset 11, with fixed input size), including FasterRCNN, MaskRCNN, and KeypointRCNN. We have also enhanced export support for some tensor indexing scenarios, with more enhancements to come in the next release. In addition, 20+ new PyTorch operators are enabled in ONNX exporter.

Expanding Coverage for ONNX Opset 11

* `torch.sort/torch.topk` are supported in Opset 11 ([25739](
* `torch.size/torch.squeeze/torch.unsqueeze/` are supported in Opset 11 ([27578](
* `torch.masked_select/torch.masked_scatter` are supported in Opset 11 ([25949](
* `torch.arange` is supported in Opset 11 ([26875](
* `avg_pool, constant_pad_nd, reflection_pad, replication_pad` Support enhanced in Opset 11 ([28225](
* `torch.hardtanh` is supported in Opset 11 ([30169](
* Enable ONNX constant folding for opset 11 ([29011](

Exporting More Torch Operators/Models to ONNX

* `torch.remainder` is enabled in exporter ([24410](
* `torch.unfold` is enabled in exporter ([24970](
* `torch.slice/` with negative index are enabled in exporter ([25273](, [26549](
* `torch.ones/torch.ones_like/torch.zeros/torch.zeros_like/torch.full/torch.full_like` with default dtype are enabled in exporter ([27577](
* `torch.unbind` is enabled in exporter ([27247](
* `torch.nn.functional.interpolate` export is enhanced ([27179](, [27566](, [28560](, [29489](
* `torch.det` is enabled in exporter ([26958](
* `torch.group_norm` is enabled in exporter ([27071](
* `torch.meshgrid` is enabled in exporter ([26037](
* `torch.randn/torch.randn_like` are enabled in exporter ([28470](, [29354](
* `torch.weight_norm` enabled in exporter ([28618](
* `torch.scalar_tensor` is enabled in exporter ([28713](
* `torch.logdet` is enabled in exporter ([29767](
* `torch.batch_norm` 2D with affine=False is enabled in exporter ([29458](
* `torch.bitshift` is enabled in exporter ([28210](

Enhancing Export/Test Infra

* Use deepcopy inputs in ONNX ORT test cases ([27186](
* Return NotImplemented from all binary math ops ([27423](
* Disabling ONNX IR v4 sematics for opset 8 or lower ([28990](
* Add ONNX tests for torchvision models ([30121](
* Keep output type information while exporting ONNX graph ([25906](


Quantization updates correspond to a mix of bug-fixes and feature improvements, with feature improvements adding improved operator coverage and performance improvements.   We have also made a lot of progress towards enabling graph mode quantization support.

* Feature improvements:
* Enabling intra-op parallelism ([26692](
* Enabling inplace relu ([28710](
* Quantized Tensor support copy ([28612](
* Add quantized torch mean implementation ([27675](
* Add quantized avg_pool2d for pytorch mobile ([27631](
* Add nn.quantized.Conv3d ([29813](
* Adding inplace quantized relu6 ([29245](
* Fast histogram observer ([29790](
* PackedSequence support for quantized LSTM ([29585](
* Improve legacy QuantizedLinear functions to reduce overhead ([29773](
* Add support for quantized operator conversion from PT to C2 via ONNX ([29694](
* enable per channel dynamic quantization ([30122](
* Scripting support:
* Make PerChannelMinMaxObserver scriptable using `torch.jit.ignore` ([29416](
* Make HistogramObserver scriptable with `torch.jit.ignore` ([27950](
* Fix tracing for dynamic quantized LSTM ([29331](


* Fixed graph visualization: displaying proper names after recent JIT changes ([30244](
* Support logging embedding for TensorBoard visualizations to generic filesystem ([27716](

Other Improvements

* `torch.argmax/argmin` Allow half type ([28787](
* `torch.cuda.memory_stats / memory_summary` instrumentation for CUDA memory allocator ([27361](
* `torch.set_num_threads` Allow calling multiple times with TBB ([27190](
* `torch.set_num_threads` Allow calling multiple times in parallel native ([27947](
* `torch.logical_xor` Allow non-bool tensors ([27248](
* `torch.promote_types` Nicer error message. ([27941](
* `torch.batch_norm_elemt` Add an out-variant ([27621](
* `torch.lerp` Implement derivative with respect to weight ([28219](
* `torch.complex32` Add type promotion support ([27929](
* `torch.unique` Support bool tensors ([28374](
* `torch.reshape` Improve backward for viewable geometries ([28901](
* `` Generalized factorization ([28608](
* `torch.equal` Add the intra-op parallelism ([28810](
* `torch.randint` Accept generator=None ([29748](
* `torch.bfloat16` Enabled for cuda ([27259](
* `torch.multinomial` Enable for torch.half ([29266](
* `nn.RNN` Respect the current stream in cudnn ([27026](
* `nn.RNN` Preserve nonlinearity attribute ([28058](
* `nn.Linear` Support 0-batch size. ([27211](
* `nn.functional.binary_cross_entropy` implement double backwards ([26983](
* `nn.AdaptiveAvgPool2d` Add support for NHWC memory format ([24396](
* `nn.GELU` Add GELU activation ([28944](
* `nn.LayerNorm` Handle batch size of zero ([28614](
* `nn.BatchNorm` Add NHWC support on cudnn ([23861](
* `nn.BatchNorm2d` support torch.channels_last ([28982](
* `nn.BatchNorm2d` Handle empty inputs ([30035](
* `nn.LayerNorm` Enable the intra-op parallelism ([28464](
* `nn.utils.prune` Add pruning functionality ([24076](
* `nn.Sequential` Make iterable ([28987](
* `dtype.is_signed` Ability to differentiate signed dtypes ([29511](
* `optim.lr_scheduler.MultiplicativeLR `Add new multiplicative learning rate scheduler. ([27254](
* `cuda.comm.scatter, gather` Add channel-last support ([28077](
* `at::parallel_for` Choose number of OMP threads based on GRAIN_SIZE ([26963](
* Return NotImplemented from unsupported tensor arithmetic operators ([26507](
* Automatically select proper tqdm submodule ([27108](
* Pickle support for sparse tensors ([27062](
* Vectorized complex unary and binary op support. ([26500](
* Complex support for reduce and linpack ops on CPU ([27653](
* Complex support for compare and pointwise ops on CPU ([28735](
* Make PyTorch Python 3.8 compatible ([29302](
* Buffer python warning to avoid deadlocks ([26613](
* Use NNPACK for strided convolutions. ([29084](

Bug Fixes


* Ensure NCCL error handling code is disabled for NCCL versions < 2.4 ([27124](
* Fix segmentation fault in `FileStore` with concurrent accesses. ([28812](
* Fix DDP incompatibility issue with `nn.MultiheadAttention` ([26826](


* Add `ProcessGroupAgent` termination detection algorithm ([26984](
* Fix pybind11 warnings in Python RPC handler implementation ([27284](
* Defer creating `ProcessGroupAgent` listener thread until contexts are initialized ([28013](
* Fix Python RPC handler exit crash ([27251](
* Fix distributed autograd initialization ([29069](
* Always include autograd context id in `rpc_*` / `remote` requests ([29781](
* Make `RRefContext` singleton leaky, deal with module destruct order race. ([30172](

C++ API Bug Fixes

* at::Tensor::requires_grad_ now supported ([26332](
* torch::isfinite now supported ([30083](
* torch::nn::modules_ordered_dict is deprecated ([28774](
* Add reset_parameters to torch::nn modules ([29832](
* Allow passing undefined Tensor to Module::register_parameter ([27948](
* Exclude undefined tensors in the result of Module::parameters() / named_paramters() / buffers() / named_buffers() ([30626](
* Include hierarchy information in C++ API loading error messages ([28499](
* Fix a bug: the C++ L-BFGS optimizer does not work properly if there are one or more registered tensors with no grad in the model ([27606](
* Use c10::variant-based enums for Nonlinearity and FanMode ([27933]( Support for `torch::nn::init::Nonlinearity` and `torch::nn::init::FanMode` will be removed in 1.5.


* Make dropout properly condition on training. ([29436](
* Fix aten::grad to return optional list ([29577](
* Fix `torch.arange` dtype
* Fix type sharing on loaded ScriptModules ([29826](
* Fix type sharing between traced modules ([29583](
* Check for mutable default parameters ([29833](
* Fix tracing of autograd functions ([29791](
* Check for unrolled loop in break & continue ([29474](
* Fix negative string indexing ([22700](
* Make jit.trace_module reentrant ([29411](
* Fix jit outplace tracing and reapply changes to _like operators. ([28839](
* Properly guard against inheritance on TorchScript classes ([28407](
* Fix when giving jit format warning about unsupported options ([28616](
* Fix handling of function attributes. ([28569](
* Fix pushLong() issue in pickler. ([28057](
* Fix broken name mangling ([27511](
* Fix segfault while printing value type for an error msg in emitListComprehension ([27261](
* Fix `toIValue` dict iteration ([26856](
* Fix race condition in Function::optimized_graph(). ([27012](
* Sanitize module names on legacy import ([27764](
* Python None should have its type inferred as NoneType ([26665](
* Properly set existing attributes under recursive script ([27514](


* Skip copy_same_type_transpose_ for quantized tensor ([29609](
* Add note that cuda quantization is not supported ([27829](
* Rename _intrinsic to intrinsic ([27194](
* Better error message for quantized dispatch ([28635](
* Update the misleading comments for zero_points and scale in dynamic quant linear module [1/2] ([28767](
* Avoid the misleading zero_point and scale [2/2] ([28827](
* Add the warning message for API with linear modules ([28766](
* Do not insert observers for empty sequential modules ([28384](
* Fix the padding issue of quantized average pool operator ([28260](


* Fix deadlock issues in ThreadPool ([29885](
* Disable ProfilingGraphExecutorImpl for mobile ([30067](

Other Bug fixes

* `torch.kthvalue` Fix CUDA shared memory out of bound access in findPattern ([28989](
* `` Fix source files not being saved ([28965](
* `torch.load` Fix OSError loading files larger than 2GB. ([27125](
* `torch.linspace` clearer error message for negative step sizes. ([28274](
* `torch.histc` Add range checks to avoid segfaults ([27712](
* `` Fix thread` `local issue on cpu ([28546](
* `torch.max_pool2d` Limit tensor size to max CUDA grid size ([28931](
* `torch.renorm` Fix a memory leak in CUDA renorm. ([29873](
* `torch.index_add` Fix bug in atomicAdd on CUDA for some dtypes ([29231](
* `torch.addmm` Fix handling of empty tensors ([28613](
* `nn.CTCLoss` Fix incorrect gradient for large target sizes ([27460](
* `nn.functional.ctc_loss` Fix incorrect gradient on cudnn ([27039](
* `nn.Embedding` Incorrect gradient at padding_idx in cuda kernel. ([27731](

* `nn.LayerNorm` Fix an illegal memory access error ([28196](
* `nn.Conv2d` handle zero stride ([28784](
* `nn.PoissonNLLLoss` Fix incorrect result with `full=True` ([28637](
* `nn.AvgPool2d` fix an overflow for 2^31-1 sized inputs ([30793](
* `nn.RNNBase` Fix an issue with use of children of RNN third party device types ([28562](
* `nn.Upsample` Fix “invalid configuration argument” error ([28927](
* `nn.Upsample` Fix a CUDA launch config failure ([29016](
* `optim.lr_scheduler.OneCycleLR` Correctly handle div_factor parameter ([28217](
* `` Ensure all tensors are moved ([27245](
* `EventList.total_average` Fix a regression caused by missing __iadd__ ([27498](
* `Tensor.record_stream` Ensure stream is recorded for shifted view tensors ([27371](
* `torch.hub` Handle branch names containing a slash. ([27960](
* Fix error handling in Magma kernels ([29003](
* Fix avx for c++14 ([28207](
* Fix illegal memory access thread safety issue in sparse CUDA ([29426](
* `__cuda_array_interface__` Fix stride calculation ([31450](


**Python 2 support is deprecated and will not be supported in the 1.5 release.**

`torch.optim`: `Scheduler.step(epoch)` is now deprecated; use `Scheduler.step()` instead.  ([26432](

For example:

>>> for epoch in range(10):
>>>    optimizer.step()
>>>    scheduler.step(epoch)
DeprecationWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case:
warnings.warn(EPOCH_DEPRECATION_WARNING, DeprecationWarning)

**[C++]** C++11 is deprecated and will not be supported in the 1.5 release.

**[C++]** `Tensor::is_variable()` has been deprecated.  As noted in the **Backwards Incompatible Changes** section, the distinction between variable and non-variable has been eliminated, so this check is no longer meaningful.  Generally, `is_variable()` will now return true except in some special circumstances (see [29653]( for more details).  ([29653](

**[C++]** `torch::nn::modules_ordered_dict` has been deprecated.  It is generally no longer necessary and can just be removed.  ([28774](

`torch.jit.quantized` API has been deprecated in favor of  `torch.quantization.quantize_dynamic` ([28766](


A benchmark suite is available to easily measure the performance of operators with a range of input shapes. The generated benchmark data fully characterize the performance of operators in terms of execution time. For more details see in the benchmarks/operator_benchmark directory.

* `torch.nn.functional.threshold, torch.nn.functional.layer_norm, torch.cdist` Performance of threshold (CPU), layer norm (CUDA) and cdist operations was improved ([27155,]([27634](, [25799](
* `torch.Tensor.fill_` Performance for half and bfloat16 types on CPU was improved  ([28397](
* `torch.nn.MaxPool2d` implementation for channels_last format was added ([24872](
* There is a fast pass reducing the overheads of pointwise operations relying on TensorIterator under certain conditions (contiguous inputs, no broadcast) ([29180](
* Overheads of operations with scalars/number literals was improved ([29915](
* In case of type promotion on the GPU, the values are converted on the fly, without explicit casting of the full tensor ([30018](
* reorder_dimensions in TensorIterator favors output write locality, improving overall performance when operating on discontiguous tensors ([28615](
* Float pickling speed was improved ([28553](
* GRAIN_SIZE for intra-op parallelization was unified between TH and ATen operations ([28770](
* `tensor.numel`  devirtualized, improving performance ([27294](



*Version 1.2:*

>>> torch.nn.BatchNorm2d(5).weight
Parameter containing:
tensor([1., 1., 1., 1., 1.], requires_grad=True)

A number of deprecated Linear Algebra operators have been removed ([22841](

| Removed        | Use Instead  |
| ------------- | ------------- |
| `btrifact`    | `lu` |
| `btrifact_with_info`      | `lu` with `get_infos=True`      |
| `btrisolve` | `lu_solve`     |
| `btriunpack` | `lu_unpack`    |
| `gesv` | `solve`     |
| `pstrf` | `cholesky`     |
| `potrf` | `cholesky`     |
| `potri` | `cholesky_inverse`     |
| `potrs` | `cholesky_solve`     |
| `trtrs` | `triangular_solve`     |

Sparse Tensors: Changing the sparsity of a Tensor through `.data` is no longer supported.  ([17072](

>>> x = torch.randn(2,3)
>>> = torch.sparse_coo_tensor((2, 3))
RuntimeError: Attempted to call `variable.set_data(tensor)`,
but `variable` and  `tensor` have incompatible tensor type.

Sparse Tensors: in-place shape modifications of Dense Tensor Constructor Arguments will no longer modify the Sparse Tensor itself ([20614](

*Version 1.1:*

>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)

>>> s.coalesce().indices().shape
torch.Size([1, 1])

>>> s.coalesce().values().shape

Notice `indices()` and `values()` reflect the resized tensor shapes.

*Version 1.2:*

>>> i = torch.tensor([[0, 1]])
>>> v = torch.ones(2)
>>> s = torch.sparse_coo_tensor(i, v)
>>> i.resize_(1, 1)
>>> v.resize_(1)

>>> s.coalesce().indices().shape
torch.Size([1, 2])

>>> s.coalesce().values().shape

Notice `indices()` and `values()` reflect the original tensor shapes.

Sparse Tensors: Accumulating dense gradients into a sparse `.grad` will no longer retain Python object identity.  ([17072](

*Version 1.1:*

>>> m = torch.nn.Embedding(10, 3, sparse=True)
>>> m(torch.tensor([[1,2,4,5],[4,3,2,9]])).sum().backward()
>>> assert m.weight.grad.layout == torch.sparse_coo
>>> m_weight_grad_saved = m.weight.grad

accumulate dense gradient into sparse .grad, change sparsity
>>> m.weight.sum().backward()
>>> assert m.weight.grad.layout == torch.strided
m_weight_grad_saved still refers to the .grad of m's weight
even though the sparsity has changed
>>> assert id(m_weight_grad_saved) == id (m.weight.grad)

*Version 1.2:*

>>> m = torch.nn.Embedding(10, 3, sparse=True)
>>> m(torch.tensor([[1,2,4,5],[4,3,2,9]])).sum().backward()
>>> assert m.weight.grad.layout == torch.sparse_coo
>>> m_weight_grad_saved = m.weight.grad

accumulate dense gradient into sparse .grad, change sparsity
>>> m.weight.sum().backward()
>>> assert m.weight.grad.layout == torch.strided
m_weight_grad_saved NO LONGER refers to the .grad of m's weight
>>> assert id(m_weight_grad_saved) == id (m.weight.grad)

`nn.utils.convert_sync_batchnorm` has been replaced with `nn.SyncBatchNorm.convert_sync_batchnorm `([18787)](

Example of new usage:

>>>  Network with nn.BatchNorm layer
>>> module = torch.nn.Sequential(
>>>     torch.nn.Linear(20, 100),
>>>     torch.nn.BatchNorm1d(100)
>>> ).cuda()
>>>  creating process group (optional)
>>> process_group = torch.distributed.new_group(process_ids)
>>> sync_bn_module = torch.nn.SyncBatchNorm.convert_sync_batchnorm(module, process_group)

Error Checking: `torch.addcmul` and `torch.lerp` operators enforce stronger shape requirements on the output tensor (`out=` keyword argument) and do not allow output tensor to be resized if it is also used as one of the inputs.

*Version 1.1:*

>>> x=torch.zeros(1)
>>> torch.addcmul(x, x, torch.zeros(2,3), out=x)
tensor([[0., 0., 0.],
[0., 0., 0.]])

*Version 1.2:*

>>> x=torch.zeros(1)
>>> torch.addcmul(x, x, torch.zeros(2,3), out=x)
RuntimeError: output with shape [1] doesn't match the broadcast shape [2, 3]

If you run into this error, please ensure the `out` parameter is of the correct output shape (post-broadcasting).

Error Checking: Improved Variable version tracking ([20391](, [22821](, [21865](

PyTorch’s autograd system uses a version tracking mechanism to ensure that Tensors that are saved for backwards computations retain their correct values when the backward pass is computed (i.e. that they haven’t been updated in-place since they were saved).  See [In Place Correctness Checks]( in the docs for more information.

In PyTorch 1.2 we have enhanced the version tracking in a number of cases, which may flag issues that were not caught previously.  There is now additional tracking through the `Variable()` constructor, the `nn.Parameter()` constructor, after setting `.data`, and via `nn.Module._apply` (internal API).

*Track changes through Variable constructor:*

>>> x = torch.ones(1, requires_grad=True)+1
>>> y = x*x

do an in-place update through Variable constructor
>>> torch.autograd.Variable(x).add_(1)
>>> y.backward()
RuntimeError: one of the variables needed for gradient computation has been modified
by an inplace operation: [torch.FloatTensor [1]] is at version 1; expected version 0

*Track changes on an nn.Parameter:*

>>> x = torch.ones(1)
>>> p = torch.nn.Parameter(x)
>>> y = p * p

do an in-place update on a saved Parameter
>>> x.add_(1)
>>> y.sum().backward()
RuntimeError: one of the variables needed for gradient computation has been modified
by an inplace operation: [torch.FloatTensor [1]] is at version 1; expected version 0

*Track changes after setting `.data`:*

>>> x = torch.zeros(1, requires_grad=True)+1
>>> y = x * x
>>> = torch.zeros(1, requires_grad=True)+1

>>> x.add_(1)
>>> y.backward()
RuntimeError: one of the variables needed for gradient computation has been modified
by an inplace operation: [torch.FloatTensor [1]], which is output 0 of AddBackward0,
is at version 1; expected version 0 instead.

[JIT] Python called from scripted modules must be `ignore`d

`torch.jit.script` now recursively compiles everything it finds in the original function, so if you had Python functions called from in your scripted function or module, you must now explicitly `ignore` it. See [the new API guide]( for more details.

*Version 1.1*

def my_unscriptable_python_fn():
weird stuff

def fn():
This gets inserted as a Python call, and only errors on `save()`.

*Version 1.2*

torch.jit.ignore   this needs to be added ...
def my_unscriptable_python_fn():

def fn():
... or else recursive compilation will attempt to compile this call

NOTE: This is also a change to behavior of the `torch.jit.ignore` decorator. In version 1.1, `ignore` tells the compiler to omit compiling a function entirely, to mark Python functions that you know will not be called after export. In version 1.2 `ignore`, tells the compiler to insert a call back to the Python interpreter instead of trying to compile the function.

To get the old behavior, use `torch.jit.ignore(drop_on_export=True)` (`torch.jit.ignore` with no arguments is equivalent to `torch.jit.ignore(drop_on_export=False`)).

[JIT] `optimize` for ScriptModules is now a context manager

Whether optimization passes are run is now a thread-local flag. This better reflects how optimization actually happens in the JIT (i.e. it is decided at runtime, not compilation time).

*Version 1.1*

def fn(inputs):


*Version 1.2*

def fn(inputs):

with torch.jit.optimized_execution(False):

[jit] `script::Module` is now a reference type

To better align with the [PyTorch C++ API philosophy](, `script::Module` and `script::Method` are now reference types. Our APIs have been updated to use `script::Module` instead of `std::shared_ptr<script::Module>`.

*Version 1.1*

using torch::jit::script::Module;

std::shared_ptr<Module> m = torch::jit::load("");

*Version 1.2*

using torch::jit::script::Module;

Module m = torch::jit::load("");

[C++ only] mean() / sum() / prod() APIs have changed slightly ([21088](

*Version 1.1 API*:

Tensor sum(IntArrayRef dim, bool keepdim=false) const;
Tensor sum(IntArrayRef dim, ScalarType dtype) const;

*Version 1.2 API*:

Tensor sum(IntArrayRef dim, bool keepdim=false,
c10::optional<ScalarType> dtype=c10::nullopt) const;

that is, to override `dtype`, `keepdim` must now be provided.

Binary distribution and nightly changes

We have streamlined our conda and wheel binary distributions, so that it is easier than ever to install the version of PyTorch appropriate for your needs. The install instructions on have been updated, but if you have tooling to download and install PyTorch, here is a detailed description of the changes we made:

**Wheels now have local version identifiers.** Wheels that are for non-default CUDA configurations (the default CUDA version for this release is 10.0) now have local version identifiers like +cpu and +cu92. This means that, when installing, it is no longer necessary to specify a full wheel URL—just specify an appropriate version constraint like `torch==1.2.0+cu92`.

*Version 1.1 (for Python 3.7 on Linux only):*

pip install numpy
pip install

*Version 1.2 (works for all versions of Python, and both Linux and Mac):*

pip install torch==1.2.0+cpu -f

**CPU-only binaries on conda can be selected with the cpuonly feature.** We’ve eliminated the pytorch-cpu conda package; instead, the cpu-only conda package can be enabled by installing the cpuonly metapackage. Similarly, there is no longer both a torchvision and torchvision-cpu package; the feature will ensure that the CPU version of torchvision is selected.

*Version 1.1:*

conda install -c pytorch pytorch-cpu

*Version 1.2:*

conda install -c pytorch pytorch cpuonly

**Conda nightlies now live in the pytorch-nightly channel and no longer have “-nightly” in their name.** We have added a new dedicated channel for nightlies called pytorch-nightly; all nightlies (pytorch, torchvision, torchaudio, etc.) will now be uploaded to this channel, but with the same name as their corresponding stable versions (unlike before, when we had a separate pytorch-nightly, torchvision-nightly, etc. packages.) This makes it more difficult to accidentally install a copy of the nightly and stable at the same time.

*Version 1.1:*

conda install -c pytorch pytorch-nightly

*Version 1.2:*

conda install -c pytorch-nightly pytorch

**Wheel nightlies no longer have -nightly in their name.** Similar to the changes we made in Conda, we no longer suffix wheel nightlies with “-nightly”, to make it harder to accidentally install a copy of nightly and stable at the same time.

*Version 1.1:*

pip install --pre torch_nightly -f

*Version 1.2:*

pip install --pre torch -f

New Features

Tensor Type Support

* `torch.bool`: added support for many operators (masking, comparison, arithmetic operators) to achieve feature parity with `torch.uint8`.  See the **Breaking Changes** section for details about how this could affect existing programs. ([21032](, etc.)
* `torch.sparse.HalfTensor`: Added support for `torch.float16` sparse Tensors on both CPU and CUDA.  ([19695](
* `torch.bfloat16`: Added basic creation and serialization support for [Brain Floating Point Tensors]( ([21522](, [21523](, [21860](, [22852](

NN Package

* `nn.Transformer`: added implementation of Transformer from [Attention is All You Need]( ([20170](, [22588](
* `nn.Embedding`: support `float16` embeddings on CUDA.  ([19695](
* `nn.Flatten`: added a Module that performs `torch.flatten`. ([22245](
* `nn.functional.gelu`: Added support for [Gaussian Error Linear Units]( ([20665](, [21237](
* `nn.Module hooks`: add ability to replace input/output via `forward_pre_hook` and `forward_hook`. ([22285](
* `nn.Module`: add `requires_grad_() `method for turning on/off `requires_grad` for Module parameters. ([22576](


* `Tensor.to_sparse`: now supports autograd. ([20458](
* `Tensor.fill_diagonal_`: operator to fill the main diagonal of a Tensor. ([21892](
* `torch.qr`: supports autograd. ([21274](
* `torch.bitwise_not`: add operator for boolean/integer types.  Also have python `~` operator use this. ([22283](, [22320](
* `torch.trapz`: integrate using the trapezoid rule; equivalent to [numpy.trapz]( ([21610](
* `torch.var_mean` / `torch.std_mean`: compute variance and mean at the same time.([18731](
* `torch.utils.ThroughputBenchmark`: benchmark utility for measuring the throughput of PyTorch operators. ([20766](
* `Logging`: lightweight at-most-once logging to record operators that are used (`c10::Logging`). ([20745](

Optim Package

* `optim.AdamW`: introduce AdamW optimizer from [Decoupled Weight Decay Regularization]( ([21250](
* `optim.LBFGS`: added support for strong Wolfe line search. ([8824](

Distributed Package

* `DistributedDataParallel`: support CPU modules.  ([20236](
* `DistributedDataParallel`: support sparse tensors. ([19146](
* `DistributedDataParallel`: support local gradient accumulation. ([21736](


* `IterableDataset`: introduces a new type of Dataset designed for data read from a stream. ([19228](

Tensorboard Package

* TensorBoard support in PyTorch has improved and is no longer experimental!
* `SummaryWriter.flush`: now supported. ([20607](
* `SummaryWriter.add_mesh`: add support for 3D point clouds. ([20413](

JIT Features

* Improved support for iterator infrastructure. TorchScript now supports looping through a `List`, `Tuple`, `Dict`, `Tensor`, `String` and you can also use `zip()`, `enumerate()`, and ``. ([21801](, [22006](, [21990](, [21985](
* Support `in` membership checks. ([21527](
* Improved support for strings and the string libraries. ([20826](, [20188](, [20761](, [21656](, [20617](
* Improved `math` support. ([20979](, [19707](, [21151](, [21131](, [21129](, [21130](, [21512](, [21126](, [21127](, [21128](
* Support for various other Python builtin functions. ([21451](
* Support for `NamedTuple`. ([21428](
* All the rest of the `dict` methods. ([21979](
* `sorted()` keyword for lists and dicts. ([23274](
* Add support for breaks and continues. ([21692](
* Improved custom operator API with several bugfixes and new features. It now allows more primitive types, supports `torch::List`, `torch::Dict` and `torch::Optional`, supports dispatch (i.e. registering a different function for CPU and CUDA for the same operator).
* Support `nn.GRU` in script. ([23266](
* Support `pack_padded_sequence` and `pad_packed_sequence`. ([23249](
* Support `torch._C._get_tracing_state` in TorchScript. ([23248](
* Support `torch.as_tensor` in TorchScript. ([23247](
* add support for recursive compilation on `Modules`. ([20708](
* add `all` builtin. ([20521](
* Add `Final[T]` annotated members to `__constants__`. ([21603](
* Add `save()` to scripted `Function`s. ([20386](
* Support for serializing class attributes. ([22953](
* Support for class annotations. ([21379](
* support Python 3.8 `Constant` node. ([22007](
* Support for type annotations instead of `torch.jit.annotate()`. ([21390](
* Support operator overloading for user-defined classes. ([20033](
* Support recursive `ModuleList` / `Sequential`. ([21306](
* Trace multiple methods in a single `Module`. ([19905](


* `Tensor.pin_memory()`: only ask for context on current device. ([22229](
* `Tensor.view()`: suggest using `reshape()` instead of `contiguous()` when the input is non-contiguous. ([20968](
* `Tensor.numpy()`: throw `TypeError` instead of `ValueError` if the type isn’t supported. ([21608](
* `torch.norm`: add support for `p="nuc"` with `dim` specified. ([21022](
* `torch.qr`: support batching of input matrices. ([20689](
* `torch.qr`: support `some` parameter akin to NumPy's `mode` option. ([20689](
* `torch.det` / `torch.logdet` / `torch.slogdet`: added batching support. ([22909](
* `torch.cdist`: support batching. ([20934](
* `torch.symeig`: support batching. ([21858](
* `torch._dirichlet_grad`: support CUDA. ([21191](
* `torch.randperm`: support `torch.float16`. ([22102](
* `torch.Size` is now pickle-able in Python2. ([20952](
* `torch.tensor` / `torch.as_tensor`: infer device if input supports Numba’s [`__cuda_array_interface__`]( ([20584](
* `torch.isinf` / `torch.isfinite`: throw `TypeError` instead of `ValueError` when a non-tensor is passed in. ([20817](
* `nn.MultiheadedAttention`: add functional support. ([20415](
* `nn.MultiheadedAttention`: added support for key/value to have different number of features. ([21288](
* `nn.MultiheadAttention`: allow static key/values. ([21288](
* `nn.Conv{1,2,3}D`: support `torch.int64` dtype in forward. ([20730](, [22594](
* `nn.AvgPool{1,2,3}D`: support `torch.int64` dtype in forward. ([22433](
* `nn.Module`: make `_save_to_state_dict` overrideable. ([21933](
* `autograd`: Checkpointing of modules inside large fanout networks no longer hits a recursion error. ([22397](
* `autograd`: Track in-pace changes of Tensors through `Module._apply` (internal API). ([21865](
* `autograd.profiler`: Add shape aggregation support.  [20035](
* `autograd.profiler`: Profile custom c10 ops. ([20175](
* `DataLoader`: support setting `batch_size=0` to disable automatic batching (collation) in `DataLoader` for easier bulk loading.  ([19228](
* `DataLoader`: add `multiprocessing_context` parameter. ([22990](
* `DataLoader`: added error detection for `worker_init_fn`. ([20150](
* `DataLoader`: Retry on `EINTR`. ([21723](
* `torch.cuda.set_rng_state` / `torch.cuda.get_rng_state`: accept string as `device` parameter. ([23448](
* `CUDA`: add warning when using Turing GPUs and CUDA <= 9000. ([21468](
* `CUDA`: warn on conditions that can trigger a cuBLAS 9.0 bug.  ([22034](
* `CPU`: Improve CPUAllocator OOM message. ([20618](
* `[memory_format]`: added support for `torch.empty`, `torch.empty_like`, `Tensor.contiguous()`, `Tensor.is_contiguous()` to specify / check the order in which dimensions are laid out in memory. ([20455](, [20558](
* `distributions.MultivariateNormal`: fix precision matrix instability. ([21366](
* `distributions.transforms.SigmoidTransform`: fix numerical instability. ([19802](

Distributed Improvements

* `DistributedDataParallel`: Support DDP forward/backward calls even if no module parameter is used. ([19821](
* `DistributedDataParallel`: Only call into reducer if grad is enabled. ([19897](
* `DistributedDataParallel`:  Require finalize DDP backward only when there are indeed gradients computed, this allows application to completely discard DDP outputs and move on to the next iteration. ([19901](
* `DistributedDataParallel`: Improve DDP backward reduction error messages. ([20586](
* `DistributedDataParallel`:  make DDP failure recoverable. ([21591](
* `DistributedDataParallel`:  Delay reduction of unused parameters until first autograd hook is called. ([22219](
* `c10d:` support tensors shared across processes. ([21449](
* `c10d:` `ProcessGroupMPI` Add device guard around MPI operations. ([22446](
* ``: Make shuffling optional. ([22479](

Tensorboard Improvements

* Usage of kwarg-only arguments has been removed. ([21786](

Numpy Compatibility Improvements

* `Tensor.T:` added numpy-like support for reversing dimensions. ([20598](
* `Tensor.ndim`: NumPy equivalent property for the number of dimensions. ([20565](
* `Tensor.nonzero`: added `as_tuple` argument (default `False`) that when `True`, will return a tuple of Tensors, which matches the behavior of [numpy.nonzero]( ([20293](
* `torch.dtype`: support passing in NumPy dtypes as arguments. ([21215](
* `torch.normal`: add `size` parameter when called with two floats. ([20545](
* `torch.where`: add one-argument overload that is an alias for Numpy-like `nonzero`. ([21986](
* support a number of argument name overrides, e.g. `axis` instead of `dim`. ([20451](

JIT Improvements

* The original source code debug information is now saved with the model. If a model is saved and then loaded into another process, the loaded process can now print out error messages that point to the original source code. ([22177](, [22178](, [22179](, [22180](
* Error message source range highlighting now includes filename, line number, and column number. ([21157](
* Better Constant Propagation through Tuples. ([22561](
* Add `start` and `step` parameters for `range` in TorchScript. ([20795](
* Support for threading options for TorchScript inference ([doc](
* Add `max_pool2d` to symbolic derivatives. ([19661](
* Optimize `matmul` memory usage for certain cases. ([23433](
* Avoid kernel launches for zero-sized tensor inputs. ([22790](
* Add support for steps (strides) in tensor slices. ([20929](
* Added error for classes that don't have an `__init__` function. ([21880](
* Allow classes to be used in their own methods. ([20106](
* Better error message when a variable is conditionally defined. ([20911](
* Consider contained types in alias analysis. ([21431](
* Convenience APIs for script objects. ([20226](
* Don't print backtrace for interpreter errors. ([20925](
* Improve error msg for missing attribute. ([20779](
* Improve error msg on inferred type. ([21058](
* Improve error msg on recursive class defs. ([21842](
* Include module names in recursive error stacks. ([22921](
* Improve recursive scripting error message. ([21841](
* Index into a tuple with non constant integer. ([20081](
* Let `ScriptModule` buffer attributes can also cast device/type. ([19700](
* Lower batchmm to non-diff optimization. ([19987](
* Make `` an attribute instead of a parameter. ([21078](
* Make `strtod_c` compatible with different gcc abi. ([21293](
* make magic methods work with casts too. ([20654](
* Improve performance of alias analysis. ([20899](
* Print a warning if a type annotation prefix is invalid according to mypy. ([20884](
* schema_matching.cpp: improve error messages. ([21141](
* Resolve with closed over variables instead of stack frame. ([22270](
* Report errors through call stack. ([22280](
* Reduce number of stack manipulation instructions in interpreter. ([21240](

C++ API Improvements

* `nn::PoissonNLLLoss`: Added support. ([19316](
* `nn::Module`: added `replace_module` API to overwrite submodules in C++ Frontend. ([22546](
* `nn:Module::register_module` / `register_parameter` / `register_buffer`: make public ([23196](
* `data::datasets::ChunkDataReader`: fix include headers and a vector issue. ([19485](
* `data::datasets::ChunkDataset`: add new `get_batch` method. ([21797](
* `data::datasets::ChunkDataset`: add checkpoint support. ([21889](
* `data::datasets::ChunkDataset`: add support for cross-chunk shuffling. ([22347](
* `data::datasets::ChunkDataset`: add sorting policy. ([23053](

MKLDNN Tensor Improvements

Add support for a number of operators on MKLDNN Tensors including:
* `Tensor.is_mkldnn`: ([22386](
* `Tensor.transpose()`: ([21943](
* `Tensor.zero_()`: ([20573](
* `torch.empty`: ([21184](
* `torch.mul`: ([20575](
* `nn.AdaptiveAvgPool{1,2,3}D`: ([19818](
* `nn.Sigmoid`: ([20820](
* `nn.Softmax`: ([21516](
* `nn.Module`: support saving/loading MKLDNN modules. ([20799](
* `nn.MaxPool{1,2,3}D`: support `ceil_mode`. ([21310](

Bug Fixes

* Indexing: fix advanced indexing where there are more than (2^31)-1 bytes in the output. ([20919](
* Indexing: fix indexing when there are more than 65535 elements in a non-indexing first dimension on CUDA. ([23123](
* Indexing: fix issue with slicing empty tensors. ([20914](
* `Tensor.index_copy_:` fix segfault by properly checking dimension is in range. ([21617](
* `Tensor.copy_`: Fix a bug where non-blocking was not being respected. ([20305](
* `Tensor.clone`: Fix an issue with MKLDNN tensors. ([20943](
* Tensor subclassing: give a proper error instead of crashing. ([20283](
* ``:  Fix segfault with tensors that can't be indexed with 32-bit ints. ([21530](
* `torch.range` / `torch.linspace` / `torch.logspace`: properly respect the current `Stream`. ([21619](
* ``: return the identity permutation instead of zeros when not using pivoting. ([22242](
* `torch.einsum`: Fix an issue where the backward pass would potentially be skipped. ([22111](
* `torch.cosh`: Fix an issue where `torch.cos` was instead calculated with `torch.double` dtype and vectorized instructions. ([20797](
* `torch.triu` / `torch.tril`: handle strides correctly for in-place versions. ([22730](
* `torch.triu` / `torch.tril`: Fix handling of batches > 65535 on CUDA. ([21067](
* `torch.inverse` / `torch.solve` / `torch.cholesky_solve` /  `torch.triangular_solve`: Fix batch sizes > 65535 on CUDA. ([21689](
* `torch.histc`: return `dtype` is now the same as the input tensor on CUDA, matching CPU behavior. ([20369](
* `torch.histc`: properly return 1-dim tensor on CPU with 0-dim input and 1 bin. ([21497](
* `torch.randperm`: handle non-contiguous `out` parameter. ([23043](
* `torch.unique`: Fix empty tensor handling when `dim` is passed as an argument. ([19000](
* `torch.min` / `torch.max`: properly error on empty tensor inputs, as with CPU tensors. ([19612](
* `CUDA`: fix launch parameters for reductions. ([22827](
* `torch.hub`: fix an issue with `find_module`. ([20782](
* `autograd`: Fix a number of custom autograd `Function` corner cases by inverting the relationship between PyFunction and THPFunction. ([22983](
* `autograd`: give “Trying to backward through the graph a second time" error instead of internal assert when the buffers are a list of Tensors (with indexing). ([21533](
* `optim.lr_scheduler.CosineAnnealingLR`: rename from CosineAnnealingLr. ([23242](
* `distributions.Binomial`: Fix overflow of `log_prob` when `logits` is large. ([20679](
* `distributions.SigmoidTransform`: Fix numerical issues that could result in `inf` / `-inf` return values. ([20288](
* `distributions.Categorical.sample`: fix a view bug. ([23328](
* `CUDA`: Give proper error message for bad cuda forks. ([23322](
* `pickle`: Fix Unpickling error when loading multiple objects from a file. ([20270](
* `NCCL`: Fix race condition. ([23040](

torch.nn Bug Fixes

* `nn.Conv{1,2,3}D`: fix memory leak on MKLDNN code path. ([22392](
* `nn.Conv{1,2,3}D`: properly unpickle older pickled versions. ([21687](
* `nn.CTCLoss`: fix backward on CUDA when 2d target tensor is larger than `max_target_length`. ([20971](
* `nn.CTCLoss`: fix some numerical stability issues. ([21392](
* `nn.CTCLoss`: disable buggy non-deterministic CudNN algorithm. ([22977](
* `nn.CTCLoss`: fixed empty target handling. ([21910](, [23298](
* `nn.SyncBatchNorm`: fix syncing of running statistics when count size differs between GPUs. ([22248](
* `nn.SyncBatchNorm`: retain `requires_grad` value when converting from `nn.BatchNorm`. ([22569](
* `nn.SyncBatchNorm`: correctly handle `process_group` in `convert_sync_batchnorm`. ([19240](
* `nn.MultiheadedAttention`: fix for `torch.float16` dtype. ([21658](
* `nn.EmbeddingBag`: fix NaN output when input is empty. ([21400](
* `nn.Dropout`: fix python crash (with SIGFPE) when called on an empty cuda tensor. ([20541](
* `nn.MaxPool`: fix output size calculation in some corner cases. ([22304](
* `nn.MaxPool`: return valid indices if all entries are `-inf`. ([23161](
* `nn.Softmax`: respect the current Stream. ([22470](
* `nn.LogSoftmax`: fix numerical stability issues. ([21672](
* `nn.Module.load_state_dict`: break ref cycle. ([20397](
* `nn.Module`: fix loading in 32-bit environments. ([20900](
* `nn.utils.rnn.pack_padded_sequence`: Fix segfault on empty tensors. ([21461](
* `nn.utils.spectral_norm`: fix loading `state_dict` when `strict=False`. ([22545](
* `CudNN`: Fix uninitialized PoolWindow on Windows. ([22405](

Distributed Bug fixes

* `nn.parallel.DataParallel`: fix error in `no_grad` mode. ([21262](
* `torch.distributed.all_gather`: fix errors for views and aliases. ([21490](
* `c10d`: fix collective communication errors on empty tensors. ([20658](

JIT Bug Fixes

* Fix specialized list from dict keys. ([23267](
* Switch keys to be sequential and stable in pickle serialization. ([23280](
* `deepCopy` also copies type information of lists, ([23271](
* `dictKeys` and `dictItems` ops on typed dicts return typed lists. ([23270](
* Fix pickler bug where it would not load if no tensors were saved. ([23263](
* Avoid multiple writes to files on export. ([21186](
* Better error msg for mismatched `dict` key type. ([22231](
* Better error msg for using Python `builtin_function_or_method`. ([22935](
* Better error msg in `__get_state__` to let a user know that ScriptModules can't be deep-copied at the moment.([20885](
* Better error msg when seeing a unsupported builtin function. ([21068](
* `dropout` derivative should respect the `train` flag. ([20760](
* Fix `__constants__` for some nn modules. ([21071](
* Fix `ScriptModule.__dir__()`. ([22426](
* Fix 3x DenseNet compile time regression by restoring earlier-out tests in AliasDB::writesToAlias. ([21425](
* Fix a bug in loop unrolling. ([21239](
* Fix alias annotations for dict ops. ([22900](
* Fix inaccurate SourceRange reporting. ([21109](
* Fix broken indexing when using None and ellipses indexing together. ([22905](
* Fix bug in `CompilationUnit::define`. ([21886](
* Fix compilation order for class methods. ([20094](
* Fix dead code elimination over loops. ([22632](
* Fix dead code elimination in onnx export. ([22476](
* Fix incorrect default on `Graph::toString`. ([21370](
* Fix optional type promotion for classes. ([21593](
* Fix optional type unification. ([19813](
* Fix `NameError` with `PYTORCH_JIT=0`. ([20120](
* Fix overspecializing constants in compilation. ([22816](
* Fix `pow()` bug on overloads. ([20824](
* Fix recusive method compilation. ([21862](
* Fix reflection on weak modules, copy attributes. ([20190](
* Fix slow unpickling. ([21542](
* Fix input/output type mismatch. ([20829](
* Fix insert_guard for norm decomposation. ([19646](
* Fix Trace inlining of graphs with optional inputs. ([22686](
* Fix tracing bugs where using `1 - x` in C++ would cause the size of 1 to get hardcoded. ([20932](
* Fix tuple indexing bug. ([21521](
* Fix type hints for `None` constants. ([23029](
* Fix weak module cuda() `_flat_weights bug`. ([21107](
* Fix `WeakIValueEq`. ([21891](
* Fixed gcd to use 64 bit integers. ([21041](
* Fixed `list()` not making a copy. ([22093](
* Fix race condition on `Module::forward` method. ([21398](
* Made `a += b` for lists do an in place add. ([21896](
* Made `floor/ceil` return ints. ([21124](
* Out-of-memory on GPU due to the "weak_script" decorators. ([20588](
* Override print when python is present. ([21625](
* Set `__file__` for `torch.ops`. ([21888](
* Set correct list type in pybind_utils. ([23188](

C++ Frontend bug fixes

* `nn::RNN`: Fix assertions in bidirectional RNN. ([22850](
* `nn::MaxPool ` / ` nn::AvgPool`: expand incomplete kernel size, as in Python. ([22073](, [22075](
* `Optim`: Fix memory leak when `weight_decay` is applied to `Adam`, `Adagrad`,  `RMSProp`. ([23125](
* `Optim::SGD`: fix memory leak with weight_decay. ([23007](
* `torch::autograd::Scatter` `/ torch::autograd::Gather`: Fix nullptr bug. ([20286](
* `torch::nn::parallel::data_parallel`: fix gradient computation error. ([20910](
* [C++ Extensions] Fix an issue when building multiple extensions in the same directory. ([20221](


**Masking via `torch.uint8` Tensors is now deprecated in favor of masking via `torch.bool` Tensors.**

See the **Breaking Changes** section for more details about `torch.bool` Tensors and comparison operators.

**`torch.masked_select`, `torch.masked_fill`, `torch.masked_scatter` now expect `torch.bool` masks rather than `torch.uint8`.**

>>> a = torch.tensor([1, 2, 3])
>>> b = torch.tensor([3, 1, 2])

>>> a.masked_select(tensor([0, 1, 1], dtype=torch.uint8))
UserWarning: masked_select received a mask with dtype torch.uint8,
this behavior is now deprecated, please use a mask with dtype torch.bool instead.

tensor([2, 3])

instead use torch.bool
>>> a.masked_select(tensor([False,  True,  True]))
tensor([2, 3])

**Comparison operators with `out=` parameters now expect `torch.bool` dtype rather than `torch.uint8`.**

>>> a = torch.tensor([1, 2, 3])
>>> b = torch.tensor([3, 1, 2])
>>> res = torch.empty_like(a, dtype=torch.uint8)
>>>, b, out=res)
UserWarning: received 'out' parameter with dtype torch.uint8, this behavior
is now deprecated, please use 'out' parameter with dtype torch.bool instead.

tensor([0, 1, 1], dtype=torch.uint8)

instead use torch.bool
>>> res = torch.empty_like(a, dtype=torch.bool)
>>>, b, out=res)
tensor([False, True, True])

Legacy `autograd.Function` (Function without static forward method) is now deprecated

>>> class MyLegacyFunction(Function):
>>>     def forward(self, x):
>>>         return x
>>>     def backward(self, grad_output):
>>>         return grad_output
>>> MyLegacyFunction()(torch.randn((3,), requires_grad=True)
UserWarning: Legacy autograd function with non-static forward method is deprecated
and will be removed in 1.3. Please use new-style autograd function
with static forward method.

instead use new-style Autograd Function
>>> class MyFunction(Function):
>>>     staticmethod
>>>     def forward(ctx, x):
>>>         return x
>>>     staticmethod
>>>     def backward(ctx, grad_output):
>>>         return grad_output
>>> MyFunction.apply(torch.randn((3,), requires_grad=True)

See the [torch.autograd.Function]( documentation for more details.

`torch.gels`: has been renamed to `torch.lstsq`; `torch.gels` will work for this release but is now deprecated.  ([23460](


* Advanced Indexing: significantly improve performance of advanced indexing backward. ([20557](
* `Tensor.copy_`: increase broadcasting CUDA copy performance by 25%. ([20685](
* `torch.matmul`: Optimize the case A.ndim <= 2 && B.ndim >= 3, shows up to 15x speed up. ([20448](
* `torch.bmm`: Improve performance by up to 3x for small cases on CPU by applying TensorAccessor. ([20266](
* `torch.inverse`: Move workspace query and allocation outside loop to improve performance by up to 5x. ([20904](
* `torch.topk`: Optimize CPU perf using parallel and partial sort, up to 6x improvement. ([22865](
* `torch.cdist`: Improve CPU perf by up to 10x for some cases. ([20605](
* `torch.normal`: Move `normal`, `normal_means`, `normal_stddevs`, and `normal_means_stddevs` to ATen, increasing performance by up to 3x. ([21287](
* `torch.bernoulli`: Speedup bernoulli_scalar_cuda_kernel with grid-stride loop, increasing performance by up to 2x. ([21300](
* `torch.coalesce`: Use `_sparse_coo_tensor_unsafe` in `coalesce` for up to 10x speedup. ([21214](
* `torch.sinh` / `torch.cosh`: Parallelize and vectorize on CPU. ([21115](
* `torch.lerp`: Vectorize on CPU. ([22038](
* `torch.eye`: Parallelize on CPU. ([21077](
* `torch.randperm`: Parallelize initialization in randperm on CPU. ([21529](
* Vectorization: Don't split 256-bit AVX2 load/store intrinsics. ([20609](

Torch.NN Performance Improvements

* `nn.Softmax`: Add persistent CUDA kernels that increase performance 2-10x on small inputs. ([20827](
* `nn.Embedding` / `nn.EmbeddingBag`: Optimize CUDA kernel, increasing performance up to 2.7x. ([22016](
* `nn.Linear`: optimize BERT model perf by using mkldnn inner product. ([21851](
* `nn.Conv{1,2,3}D`: improve perf for depthwise convolutions in `torch.float16` on Volta and Turing GPUs. ([22302](
* `nn.RNN`: optimize on CPU by fusing matmul ops. ([22512](
* `nn.Upsample`: a number of significant perf improvements on CUDA. ([21879](, [21694](
* `nn.functional.layer_norm`: optimize a fast path for layer_norm, increasing perf by up to 4x on CPU. ([20345](, [20883](
* Use `mkldnn` inner product for `nn.Linear()` to improve BERT perf. ([21851](


* `torch.bool`: doc the Boolean tensor type. ([21601](
* `torch.as_strided`: add docs. ([22842](
* `torch.empty_strided`: add docs. ([23740](
* `torch.lerp`: clarify broadcasting requirements. ([23268](
* `torch.enable_grad` / `torch.no_grad` / `torch.set_grad_enable`: clarify interaction between these features. ([23310](
* `torch.autograd.grad_mode`: Document that no_grad is thread local. ([21755](
* `torch.multiprocessing`: Explain refcounting of CUDA tensors. ([19904](
* `torch.Tensor`: Add a warning about memory usage. ([20801](
* ``: Document RNG state consumption. ([22540](
* `torch.optim.lr_scheduler.CyclicLR`: Clarify `base_momentum` and `max_momentum`. ([20880](
* Document production environment features. ([23010](
* Add note about contributing recently released research. ([23513](
* Clarify performance implications of deterministic mode. ([21337](
* Update cuda pinned memory note to include ``. ([20977](

Torch.NN Documentation

* `nn.functional / nn.init`: Break up NN in docs so they load faster. ([21291](
* `nn.functional.conv{1,2,3}d`: Remove `padding_mode`.  ([20891](
* `nn.functional.upsample` / `nn.functional.interpolate`: add note about overshooting with `mode=‘bicubic’`. ([23321](
* `nn.init.zeros_` / `nn.init.ones_`: add documentation. ([23145](
* `nn.MultiheadAttention`: Add documentation for `add_bias_kv`, `add_zero_attn`, and `attn_mask`. ([20071](
* `nn.MultiheadAttention`: Fix documentation for attention mask shape. ([20850](
* `nn.Softmax`: Fixed to specify dimension to prevent warning in 1.1.0. ([20310](*)*

Contributor Documentation

* Updated web links on contribution_guide and governance documentation. ([21243](
* Improve documentation for publishing hub models. ([21307](
* Suggest a faster linker in the contributing guide. ([21334](
* Add CUDA C++11 and profiling notes to the contribution guide. ([21386](

Build Documentation

* Add magma for CUDA 10.1 to Windows docs. ([19914](
* Improve build-from-source instructions. ([20088](
* Add `ninja` to build instructions. ([20079](
* Update libtorch build docs. ([21150](

TensorBoard Documentation

* Tensorboard Documentation has been greatly improved!  Browse the latest version [here](

Torch HUB Documentation

* Improve docs for publishing hub models. ([21307](
* Update docs of entry point in hub. ([21568](


In PyTorch 1.2, we have added the full support for ONNX Opset 7, 8, 9 and 10 in ONNX exporter, and we have also enhanced the constant folding pass to support Opset 10. The export of ScriptModule has better support. Additionally, users now are able to register their own symbolic to export custom ops, and specify the dynamic dimensions of inputs during export.

Supporting More ONNX Opsets

* Add basic supports for multiple ONNX Opsets and support for Opset 10. ([19294](
* Support ONNX Opset 7 and 8 in PyTorch ONNX Exporter. ([22421](, [20036](
* Export `Dropout` for Opset 10. ([20710](
* Export `Slice` and `Flip` for Opset 10. ([20533](
* Export `Interpolate (Resize)` for Opset 10. ([21434](

Enhancing the Support for ScriptModule

* Support multiple outputs in ScriptModule in ONNX Exporter. ([20256](
* Support tensor factories in ScriptModule in ONNX Exporter. ([20255](
* Support tuples as inputs and outputs in ScriptModule. ([20784](

Exporting More Torch Operators to ONNX

* Export custom ops. ([21321](
* Export `torch.arange `. ([22601](
* Export `torch.masked_fill`. ([22521](
* Export `torch.floor`, ` torch.ceil`, `torch.log2` and `prim::shape`. ([17895](
* Export `torch._dim_arange`. ([20078](
* Export `torch.randn_like`. ([20093](
* Export `torch._standard_gamma`. ([20126](
* Export `torch.topk`. ([21104](
* Export `__ and__`, `__or__`. ([17894](
* Export `torch.sign`. ([20470](
* Export `torch.scatter`. ([18543](
* Export `torch.rand`. ([20559](
* Export `torch.gather`. ([21235](
* Export `torch.cosine_similarity`. ([21884](
* Export `torch.sum`. ([22240](
* Export `torch.logsumexp`. ([22306](
* Export `torch.layer_norm`. ([22265](

Extending Existing Exporting Logic

* Support `torch.min` and `torch.max` with dim. ([19689](
* Support `maxpool` with dilations. ([18721](
* Support `RNN` with `batch_first=True`. ([19766](
* Support `Upsample` with dynamic input. ([20116](
* Improve support for Loop export. ([20445](
* Enable `torch.full` with scalar parameters. ([21931](
* Added support for exporting models with variable length input/output to ONNX. ([20034](

Optimizing Exported ONNX Graph

* Support constant folding in Opset 10. ([22515](
* Support negative indexing for `Slice` in constant folding optimization. ([21811](


* Fix the shape of `PReLU` weight. ([21330](
* Fix the export for `torch.pixel_shuffle`. ([21486](
* Fix the export for `torch.full`. ([21669](
* Update logic for folding `onnx::Constant` nodes. ([20109](


named_parameters to filter out specific parameter types

Let's say that you want to add weight decay to all parameters of your model except for the biases. How do you get only the biases of your model?
We introduce [nn.Module.named_parameters]( for this.
It joins `named_children` and `named_modules` in helping you filter specific attributes of models.

Example of filtering out biases of a model and give them weight_decay of 0:

import torch
import torch.nn as nn
import torch.optim as optim
m = nn.Sequential(
nn.Linear(10, 20),
nn.Linear(20, 20),
weights, biases = [], []
for name, p in m.named_parameters():
if 'bias' in name:
biases += [p]
weights += [p]

{'params': weights},
{'params': biases, weight_decay=0}
], lr=1e-2, momentum=0.9, weight_decay=1e-5)

Performance Improvements
- `cumsum` and `cumprod` have been significantly made faster on the GPU via using some thrust primitives where appropriate.
- `LSTMCell` and `GRUCell` are now significantly faster on the GPU via a fused kernel
- The default Algorithm for CuDNN has been changed to `PRECOMP_GEMM` which is a
much faster algorithm that takes a tiny bit of workspace. Previously, it used to
be `IMPLICIT_GEMM` which took zero workspace, but was significantly slower.
- 5% to 10% improvement in data loader by collating batches directly into shared memory.
- SVD is now computed on the GPU via divide-and-conquer (sgesdd) which gives a 2x to 5x speedup.
- The commonly used function `expand` has been moved to C, to have better performance in smaller models.

Bug Fixes
- Added contiguous checks on weight and bias for a large range of THNN functions
- make the range of `random_` correct when both lower and upper bound are specified
- `parallel_apply` now can take arguments that are unhashable
- Reshape `grad` correctly in the Dot function (inputs don't have to be 1D vectors...)
- Added `Variable.type_as`
- Unify argument names of `norm` and `renorm` to have `p=norm_type, dim=dim`
- `btrisolve` works on CPU doubles
- ipython autocomplete for torch.nn.Module fixed via implementing `__dir__`
- `device_ids` can now be `None` again in `F.data_parallel` and will use all available GPUs
- workaround cudnn bugs in BatchNorm (<5.1.10) and Dilation (6.0.20)
- Padding bugfix in Conv1d CPU
- `remainder` and `cremainder` are fixed for integer types
- fix memory leak in `btrisolve` and `getri`
- If nn.Module's source cant be retrieved because of any exception,
handle serialization to be non-fatal
- `collate_fn` now retains the type of the numpy array
- `is_tensor` and `is_storage` are now fixed for old-style Python classes
- `` now supports keyword arguments
- CUDA collectives supported coalescing, but the inputs were all assumed
to be of the same Tensor type. This is fixed.
- Fix a deadlock bug in autograd because of an underlying glibc bug in specific
linux distros (ArchLinux in particular)
- `abs` is now fixed for `char` and `short` cuda types
- fix `torch.diag` autograd when giving a dimension argument
- fix grouped convolution on CPU when `bias=False`
- expose `dilated` convolutions for `ConvTranspose*d`
- Fix a bug in `HingeEmbeddingLoss` where `margin` can now be specified via kwargs

Improved error messages
- Fix errors and messages when no CUDA devices are available.


we can now index with a mask that has fewer
dimensions than the indexing tensor
c = a[mask, :5]

Fast Fourier Transform
- Add new FFT methods [5856](
- Add ``torch.stft`` (short time Fourier transform) and hann/hamming/bartlett window functions. [4095](
- Support arbitrary number of batch dimensions in *FFT [6528](

New and updated Torch operators

* Added `torch.log2` and `torch.log10` [6272](
* Added `torch.isnan` [5273](
* Add `torch.reshape`, which is similar to `numpy.reshape`. It is roughly equivalent to `tensor.contiguous().view()`, but avoids copying in certain cases [5575](
* Add CPU implementation of `torch.unique`, which outputs the unique elements of a Tensor [5503](
* Add `torch.det`, `torch.logdet` and `torch.slogdet`, for computing the (log-)determinant of square 2D tensors. For negative determinants, `torch.logdet` returns `nan`, while `torch.slogdet` returns the sign of the log-determinant and the log of the absolute value of the determinant. [3816]( and [5393](
* Add `nn.functional.gumbel_softmax`, which lets you use the reparametrization trick for discrete variables [3341](
* Add `torch.take` and `Tensor.put_`. Those functions are equivalent to numpy.take and numpy.put, and are the base for full support of advanced indexing in PyTorch [3263](
* Add `torch.randint`, similar to `numpy.random.randint` [6136](
* Add `torch.diagonal` and `torch.diagflat`, similar to `numpy.diagonal` and `numpy.diagflat`. They are meant as a replacement for `torch.diag`, which handled both the cases of constructing a diagonal tensor as well as extracting the diagonal of a matrix [5622](
* Add `torch.einsum`, equivalent to `numpy.einsum`. einsum allows you to perform operations using Einstein's notation. [5503](
a = torch.arange(0, 9).reshape(3, 3)
the following transposes a
b = torch.einsum('ij->ji', (a,))

- Add ``torch.expm1``, a numerically stable ``exp(x)-1`` for small ``x``. [4350](
- Allow users to specify individual split sizes with ``torch.split`` [3837](
- Add ``torch.where(condition, tensor1, tensor2)`` that returns a tensors of elements selected from  ``tensor1`` or ``tensor2`` based on ``condition``. [4259](, [4259](
- Add ``Tensor.norm(dim)`` for sparse tensors. [4882](
- Implement ``torch.neg`` for all types. [4075](
- Implement gradient calculation for ``torch.trtrs``. [3972](
- Deprecate out-of-place ``Tensor.resize`` and ``Tensor.resize_as``. These have weird semantics and are hard to use correctly. Please use their in-place variants ``Tensor.resize_`` and ``Tensor.resize_as_``. [4886](

Rename `async` argument in ``.cuda()`` to `non_blocking`

The `async` keyword argument in conversion calls is now deprecated in PyTorch, and it has been replaced by `non_blocking`.  This was necessary because `async` will be a keyword in Python 3.7

Neural Networks

A new autograd container that lets you trade compute for memory

The new `checkpoint` container allows you to only store a subset of the outputs necessary for backpropagation. If an output is missing (to save memory), the `checkpoint` container will recompute the intermediate outputs from the closest checkpoint, so that memory usage can be reduced (with an increase in computation time).
Here is an example:
input = torch.rand(1, 10)
suppose we have a very deep model
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)
output = model(input)

The above model uses a lot of memory, because it needs to keep the intermediate values of every operation for backpropagation. `checkpoint` lets your reduce the memory requirements:


create the input tensors and set the requires_grad=True
NOTE: the requires_grad=True for the input is a current
limitation of checkpointing. At least one of the
model inputs should have requires_grad=True.
If you don't do it, you might have empty gradients.
input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]

define function that will define where
we will checkpoint and store
intermediate gradients. In this case,
we will only store one intermediate
gradient, in the middle of the

def run_first_half(*args):
x = args[0]
for layer in layers[:500]:
x = layer(x)
return x

def run_second_half(*args):
x = args[0]
for layer in layers[500:-1]:
x = layer(x)
return x

now uses the new checkpoint functionality
from torch.utils.checkpoint import checkpoint

x = checkpoint(run_first_half, input)
x = checkpoint(run_second_half, x)
last output need to be run without checkpoint
x = layers[-1](x)
x.sum.backward()   works!

For sequential modules (which can have arbitrary blocks inside), a helper function `checkpoint_sequential` is provided, which takes care of the most common use-cases:
input = torch.rand(1, 10, requires_grad=True)
layers = [nn.Linear(10, 10) for _ in range(1000)]
model = nn.Sequential(*layers)

from torch.utils.checkpoint import checkpoint_sequential

split in two blocks
num_segments = 2
x = checkpoint_sequential(model, num_segments, input)
x.sum().backward()   works!

bottleneck - a tool to identify hotspots in your code

``torch.utils.bottleneck`` ([5216](, [6425]( is a tool that can be used as an initial step for
debugging bottlenecks in your program. It summarizes runs of your script with
the Python profiler and PyTorch’s autograd profiler. See the [bottleneck docs]( for more details.

reduce=False Losses
As of this release, all of our loss functions support the ``reduce`` keyword. Specifying ``reduce=False`` gives a Tensor per unit of loss instead of a single reduced loss. [4924](, [5346](, [5646](, [4231](, [4705](,  [5680](

New modules and module improvements

* Add `DistributedDataParallelCPU`. This is similar to `DistributedDataParallel`, but with specific support for models running on the CPU (contrary to `DistributedDataParallel`, which targets GPU), and supports `mpi`, `gloo` and `tcp` backends [5919](
* Add [Group Normalization]( (`nn.GroupNorm`), an alternative to batch normalization that doesn't suffer from the same issues as `BatchNorm` for small batch sizes
* Add [Layer Normalization]( (``nn.LayerNorm``), an alternative for batch normalization often used in NLP tasks. [4922](
* Add Local Response Normalization (``nn.LocalResponseNorm``). [4922](
* `MaxPool3d` now supports double backwards. MaxPool3d and MaxUnpool3d now use indices consistent with the rest of the pooling layers. [5328](
* All loss functions now support a reduce argument to return a batch of losses. [264](
* Add util to clip gradient value in torch.nn.utils.clip_grad and add param to He initialization scheme in `torch.nn.init`. [6173](
* Renamed ``torch.nn.init.*`` methods to have an underscore in the end, as they operate in-place, and deprecate the old versions [6093](
* Added support for returning dictionaries in `DataParallel` [6113](
* Added support for N-D tensors in `torch.nn.Bilinear` [5764](
* Add `Embedding.from_pretrained` factory. This allows to initialize an Embedding layer with an existing tensor, bypassing the initial random initialization of its weights.
* You can now slice ``nn.Sequential``, ``nn.ModuleList``, and ``nn.ParameterList`` [4491](
* Registered ``nn.Module`` integer parameters and buffers are now immune to ``module.float()``, ``module.double()`` ``module.half()`` calls. [3820](

`torch.distributions` has expanded to include 24 [basic probability distributions]( `Bernoulli`, `Beta`, `Binomial`, `Categorical`, `Cauchy`, `Chi2`, `Dirichlet`, `Exponential`, `FisherSnedecor`, `Gamma`, `Geometric`, `Gumbel`, `Laplace`, `LogNormal`, `Multinomial`, `MultivariateNormal`, `Normal`, `OneHotCategorical`, `Pareto`, `Poisson`, `RelaxedBernoulli`, `RelaxedOneHotCategorical`, `StudentT`, and `Uniform`.

The [`Distribution`]( interface has expanded to include many methods including `.cdf()`, `.icdf()`, `.mean()`, `.variance()`, `.entropy()`, and `.perplexity()`. Distributions now split tensor dimensions into [`sample_shape`]([`batch_shape`]([`event_shape`]( Most continuous distributions now also implement a differentiable `.rsample()` method to compute [pathwise derivatives]( aka the reparameterization trick (check `.has_rsample` for availability):
>>> loc = torch.tensor(0., requires_grad=True)
>>> scale = torch.tensor(1., requires_grad=True)
>>> samples = Normal(loc, scale).rsample(sample_shape=(1000,))
>>> loss = (samples - 0.5).pow(4).mean()   average over 1000 monte carlo samples
>>> grad(loss, [loc, scale])
(tensor(-7.5092), tensor(15.2704))

Most discrete distributions implement an [`.enumerate_support()`]( method to make it easy to sum over all possible sample values (check `.has_enumerate_support` for availability).

[`kl_divergence`]( is defined for many pairs of distributions, e.g.
>>> x = torch.tensor(1.0, requires_grad=True)
>>> kl = kl_divergence(Uniform(-x, x), Normal(0., 1.))
>>> grad(kl, [x])[0]

Distribution Transforms
New distributions can be created by combining [`TransformedDistribution`]( with any number of [`Transform`]( objects from the [`torch.distributions.transforms`]( library, including: `ExpTransform`, `PowerTransform`, `SigmoidTransform`, `AbsTransform`, `AffineTransform`, `SoftmaxTransform`, `StickBreakingTransform`, `LowerCholeskyTransform`, and their inverses via the [`.inv`]( property.

Distribution Constraints
Distributions provide metadata about the constraints of their `.support` and about their arguments (`.arg_constraints`). These [`Constraint`]( objects are registered with transforms using [`transform_to()` and `biject_to()`]( Together constraints and transforms make it easy to specify new distributions in a generic way
>>> scale = torch.tensor(1., requires_grad=True)
>>> p = Normal(0., scale)
>>> assert p.arg_constraints['scale'] == constraints.positive
>>> prior = TransformedDistribution(Normal(0., 1.),
...                                 transform_to(constraints.positive))

Constraints in the [`torch.distributions.constraints`]( library include: `boolean`, `greater_than(lower_bound)`, `integer_interval(lower_bound, upper_bound)`, `interval(lower_bound, upper_bound)`, `lower_cholesky`, `lower_triangular`, `nonnegative_integer`, `positive`, `positive_definite`, `positive_integer`, `real`, `real_vector`, `simplex`, and `unit_interval`.


Helper utility for launching Distributed Training jobs

We have added an utility function to help launch jobs on a distributed setup.
In order to launch a script that leverages `DistributedDataParallel` on either single-node multiple-nodes, we can make use of torch.distributed launch as follows

python -m torch.distributed.launch --arg1 --arg2 --arg3

The script simplifies day to day usability of the `distributed` package.

You can read about it's usage here:

A new distributed backend based on NCCL 2.0

PyTorch now has a new distributed backend, which leverages NCCL 2.0 for maximum speed.
It also provides new APIs for collective operations on multiple GPUs.
You can enable the new backend via

Other distributed improvements

- Coalesce many small broadcasts to improve performance [4978](
- Add mixed-precision support for distributed training [4891](
- Release NCCL distributed backend. Previously it was marked as ``experimental``. [4921](
- Enable Infiniband support for Gloo data channel with automatic IB device detection [4795](

C++ extensions
Previously, the official way of writing extensions using C or CUDA for custom modules was through the cffi extension. The drawback of this method was that it required a separate step for compiling the CUDA kernels, which could be a bit messy.

PyTorch now provides a better system for [writing your own C++ / CUDA extensions]( Example implementations using this new extension support can be found in the [pytorch/cpp_extensions]( repo.

We provide two compilation modes:
- ahead of time compilation: you write a `` script using the new `CppExtension` or `CUDAExtension`, which is an extension of `setuptools.Extension` module;
- just-in-time compilation: you pass the list of C++ / CUDA files that you want to compile to `torch.utils.cpp_extension.load`, and it will compile on the fly and cache the libraries for you. Here is an example illustrating how easy it is to implement an extension:

In C++
// my_implementation.cpp
include <torch/torch.h>
include <unordered_set>

// can use templates as well. But let's keep it
// simple
using scalar_t = float;

at::Tensor unique_float(at::Tensor input_) {
// only works for floats
AT_ASSERT(input_.type().scalarType() == at::ScalarType::Float, "input must be a float tensor");
// and CPU tensors
AT_ASSERT(!input_.type().is_cuda(), "input must be a CPU tensor");

// make the input contiguous, to simplify the implementation
at::Tensor input = input_.contiguous();

// get the pointer that holds the data
scalar_t* input_data =<scalar_t>();
// let's use a function from the std library to implement
// the unique function
std::unordered_set<scalar_t> set(input_data, input_data + input.numel());

// create the output tensor, with size set.size()
at::Tensor output = input.type().tensor({static_cast<int64_t>(set.size())});
scalar_t* output_data =<scalar_t>();
// copy the content of the set to the output tensor
std::copy(set.begin(), set.end(), output_data);

return output;

// this defines the functions exposed to Python
m.def("unique_float", &unique_float, "Unique for float tensors");

And then in Python
import torch
from torch.utils.cpp_extension import load as load_ext
pass the source files, they will be compiled on the fly
and will return a python module
_C = load_ext('my_unique_lib', sources=['my_implementation.cpp'])

now can use the functions implemented in C++
unique = _C.unique_float


Table of Contents

- Breaking Changes
- New Features
- Neural Networks
- Adaptive Softmax, Spectral Norm, etc.
- Operators
- torch.bincount, torch.as_tensor, ...
- torch.distributions
- Half Cauchy, Gamma Sampling, ...
- Other
- Automatic anomaly detection (detecting NaNs, etc.)
- Performance
- Faster CPU ops in a wide variety of cases
- Other improvements
- Bug Fixes
- Documentation Improvements

Breaking Changes

- [`torch.stft`](  has changed its signature to be consistent with librosa
+ Before: `stft(signal, frame_length, hop, fft_size=None, normalized=False, onesided=True, window=None, pad_end=0)`
+ After: `stft(input, n_fft, hop_length=None, win_length=None, window=None, center=True, pad_mode='reflect', normalized=False, onesided=True)`
+ `torch.stft` is also now using FFT internally and is much faster.
- [`torch.slice`](  is removed in favor of the tensor slicing notation
- [`torch.arange`](  now does dtype inference: any floating-point argument is inferred to be the default `dtype`; all integer arguments are inferred to be `int64`.
- [`torch.nn.functional.embedding_bag`]('s old signature embedding_bag(weight, input, ...) is deprecated, embedding_bag(input, weight, ...) (consistent with torch.nn.functional.embedding) should be used instead
- [`torch.nn.functional.sigmoid`](  and [`torch.nn.functional.tanh`](  are deprecated in favor of [`torch.sigmoid`](  and [`torch.tanh`](
- Broadcast behavior changed in an (very rare) edge case: `[1] x [0]` now broadcasts to `[0]` (used to be `[1]`)

New Features

Neural Networks

- Adaptive Softmax [`nn.AdaptiveLogSoftmaxWithLoss`](

>>> in_features = 1000
>>> n_classes = 200
>>> adaptive_softmax = nn.AdaptiveLogSoftmaxWithLoss(in_features, n_classes, cutoffs=[20, 100, 150])
>>> adaptive_softmax
(head): Linear(in_features=1000, out_features=23, bias=False)
(tail): ModuleList(
(0): Sequential(
(0): Linear(in_features=1000, out_features=250, bias=False)
(1): Linear(in_features=250, out_features=80, bias=False)
(1): Sequential(
(0): Linear(in_features=1000, out_features=62, bias=False)
(1): Linear(in_features=62, out_features=50, bias=False)
(2): Sequential(
(0): Linear(in_features=1000, out_features=15, bias=False)
(1): Linear(in_features=15, out_features=50, bias=False)
>>> batch = 15
>>> input = torch.randn(batch, in_features)
>>> target = torch.randint(n_classes, (batch,), dtype=torch.long)
>>>  get the log probabilities of target given input, and mean negative log probability loss
>>> adaptive_softmax(input, target)
ASMoutput(output=tensor([-6.8270, -7.9465, -7.3479, -6.8511, -7.5613, -7.1154, -2.9478, -6.9885,
-7.7484, -7.9102, -7.1660, -8.2843, -7.7903, -8.4459, -7.2371],
grad_fn=<ThAddBackward>), loss=tensor(7.2112, grad_fn=<MeanBackward1>))
>>>  get the log probabilities of all targets given input as a (batch x n_classes) tensor
>>> adaptive_softmax.log_prob(input)
tensor([[-2.6533, -3.3957, -2.7069,  ..., -6.4749, -5.8867, -6.0611],
[-3.4209, -3.2695, -2.9728,  ..., -7.6664, -7.5946, -7.9606],
[-3.6789, -3.6317, -3.2098,  ..., -7.3722, -6.9006, -7.4314],
[-3.3150, -4.0957, -3.4335,  ..., -7.9572, -8.4603, -8.2080],
[-3.8726, -3.7905, -4.3262,  ..., -8.0031, -7.8754, -8.7971],
[-3.6082, -3.1969, -3.2719,  ..., -6.9769, -6.3158, -7.0805]],
>>>  predit: get the class that maximize log probaility for each input
>>> adaptive_softmax.predict(input)
tensor([ 8,  6,  6, 16, 14, 16, 16,  9,  4,  7,  5,  7,  8, 14,  3])

- Add spectral normalization [`nn.utils.spectral_norm`](

>>>  Usage is similar to weight_norm
>>> convT = nn.ConvTranspose2d(3, 64, kernel_size=3, pad=1)
>>>  Can specify number of power iterations applied each time, or use default (1)
>>> convT = nn.utils.spectral_norm(convT, n_power_iterations=2)
>>>  apply to every conv and conv transpose module in a model
>>> def add_sn(m):
for name, c in m.named_children():
m.add_module(name, add_sn(c))
if isinstance(m, (nn.Conv2d, nn.ConvTranspose2d)):
return nn.utils.spectral_norm(m)
return m

>>> my_model = add_sn(my_model)

- [`nn.ModuleDict`]( and [`nn.ParameterDict`]( containers
- Add `nn.init.zeros_` and `nn.init.ones_`
- Add sparse gradient option to pretrained embedding
- Add max pooling support to [`nn.EmbeddingBag`](
- Depthwise convolution support for MKLDNN
- Add `nn.FeatureAlphaDropout` (featurewise Alpha Dropout layer)


- [`torch.bincount`](  (count frequency of each value in an integral tensor)

>>> input = torch.randint(0, 8, (5,), dtype=torch.int64)
>>> weights = torch.linspace(0, 1, steps=5)
>>> input, weights
(tensor([4, 3, 6, 3, 4]),
tensor([ 0.0000,  0.2500,  0.5000,  0.7500,  1.0000])

>>> torch.bincount(input)
tensor([0, 0, 0, 2, 2, 0, 1])

>>> input.bincount(weights)
tensor([0.0000, 0.0000, 0.0000, 1.0000, 1.0000, 0.0000, 0.5000])

- [`torch.as_tensor`](  (similar to `torch.tensor` but never copies unless necessary)

>>> tensor = torch.randn(3, device='cpu', dtype=torch.float32)
>>> torch.as_tensor(tensor)                        doesn't copy
>>> torch.as_tensor(tensor, dtype=torch.float64)   copies due to incompatible dtype
>>> torch.as_tensor(tensor, device='cuda')         copies due to incompatible device
>>> array = np.array([3, 4.5])
>>> torch.as_tensor(array)                         doesn't copy, sharing memory with the numpy array
>>> torch.as_tensor(array, device='cuda')          copies due to incompatible device

- [`torch.randperm`](  for CUDA tensors
- [`nn.HardShrink`]( for CUDA tensors
- [`torch.flip`](  (flips a tensor along specified dims)
- [`torch.flatten`](  (flattens a contiguous range of dims)
- [`torch.pinverse`](  (computes svd-based pseudo-inverse)
- [`torch.meshgrid`](
- [`torch.unique`](  for CUDA tensors
- [`torch.erfc`](  (complementary error function)
- [`torch.isinf`](  and [`torch.isfinite`](
- [`torch.reshape_as`](
- Support backward for target tensor in [`torch.nn.functional.kl_div`](
- [`torch.logsumexp`](
- Add batched linear solver to `torch.gesv`
- [`torch.sum`](  now supports summing over multiple dimensions
- [`torch.diagonal`](  [`torch.diagflat`](  to take arbitrary diagonals with numpy semantics
- [`tensor.any`]( and [`tensor.all`]( on `ByteTensor` can now accept `dim` and `keepdim` arguments


- Half Cauchy and Half Normal
- Gamma sampling for CUDA tensors
- Allow vectorized counts in Binomial Distribution


- Autograd automatic anomaly detection for `NaN` and errors occuring in backward. Two functions [detect_anomaly]( and [set_detect_anomaly]( are provided for this.
- Support `reversed(torch.Tensor)`
- Support `hash(torch.device)`
- Support `gzip` in [`torch.load`](


- Accelerate bernoulli number generation on CPU
- Enable cuFFT plan caching (80% speed-up in certain cases)
- Fix unnecessary copying in `bernoulli_`
- Fix unnecessary copying in `broadcast`
- Speed-up multidim `sum` (2x~6x speed-up in certain cases)
- Vectorize CPU `sigmoid` (>3x speed-up in most cases)
- Optimize CPU `nn.LeakyReLU` and `nn.PReLU` (2x speed-up)
- Vectorize `softmax` and `logsoftmax` (4.5x speed-up on single core and 1.8x on 10 threads)
- Speed up `nn.init.sparse` (10-20x speed-up)


Tensor printing

- Tensor printing now includes `requires_grad` and `grad_fn` information
- Improve number formatting in tensor print
- Fix scale when printing some tensors
- Speed up printing of large tensors

Neural Networks

- `NaN` is now propagated through many activation functions
- Add `non_blocking` option to
- Loss modules now allow target to require gradient
- Add `pos_weight` argument to `nn.BCEWithLogitsLoss`
- Support `grad_clip` for parameters on different devices
- Removes the requirement that input sequences to `pad_sequence` have to be sorted
- `stride` argument for `max_unpool1d`, `max_unpool2d`, `max_unpool3d` now defaults to `kernel_size`
- Allowing calling grad mode context managers (e.g., `torch.no_grad`, `torch.enable_grad`) as decorators
- `torch.optim.lr_scheduler._LRSchedulers` `__getstate__` include optimizer info
- Add support for accepting `Tensor` as input in `clip_grad_*` functions
- Return `NaN` in `max_pool`/`adaptive_max_pool` for `NaN` inputs
- `nn.EmbeddingBag` can now handle empty bags in all modes
- `torch.optim.lr_scheduler.ReduceLROnPlateau` is now serializable
- Allow only tensors of floating point dtype to require gradients and
- Allow resetting of BatchNorm running stats and cumulative moving average
- Set the gradient of `LP-Pool`ing to zero if the sum of all input elements to the power of p is zero


- Add ellipses ('...') and diagonals (e.g. 'ii→i') to [`torch.einsum`](
- Add `to` method for `PackedSequence`
- Add support for `__floordiv__` and `__rdiv__` for integral tensors
- [`torch.clamp`](  now has subgradient 1 at min and max
- [`torch.arange`](  now uses NumPy-style type inference:
- Support infinity norm properly in [`torch.norm`](  and [`torch.renorm`](
- Allow passing an output tensor via `out=` keyword arugment in [``](  and [`torch.matmul`](


- Always enable grad when calculating `lazy_property`

Sparse Tensor

- Add log1p for sparse tensor
- Better support for adding zero-filled sparse tensors

Data Parallel

- Allow modules that return scalars in `nn.DataParallel`
- Allow `nn.parallel.parallel_apply` to take in a list/tuple of tensors


- `torch.Size` can now accept PyTorch scalars
- Move `` to, and `` to ``
- Add serialization for `torch.device`
- Allow copy.deepcopy of `torch.(int/float/...)*` dtype objects
- [`torch.load`](  can now take a `torch.device` as map location

Bug Fixes

- Fix [`nn.BCELoss`]( sometimes returning negative results
- Fix `tensor._indices` on scalar sparse tensor giving wrong result
- Fix backward of `tensor.as_strided` not working properly when input has overlapping memory
- Fix `x.pow(0)` gradient when x contains 0
- Fix CUDA [`torch.svd`](  and [`torch.eig`](  returning wrong results in certain cases
- Fix `nn.MSELoss` having low precision
- Fix segmentation fault when calling `torch.Tensor.grad_fn`
- Fix [`torch.topk`](  returning wrong results when input isn't contiguous
- Fix segfault in convolution on CPU with large `inputs` / `dilation`
- Fix `avg_pool2/3d` `count_include_pad` having default value `False` (should be `True`)
- Fix `nn.EmbeddingBag`'s `max_norm` option
- Fix returning scalar input in Python autograd function
- Fix THCUNN `SpatialDepthwiseConvolution` assuming contiguity
- Fix bug in seeding random module in `DataLoader`
- Don't modify variables in-place for [`torch.einsum`](
- Make return uniform in lbfgs step
- The return value of `uniform.cdf()` is now clamped to `[0..1]`
- Fix advanced indexing with negative indices
- `CUDAGenerator` will not initialize on the current device anymore, which will avoid unnecessary memory allocation on `GPU:0`
- Fix `tensor.type(dtype)` not preserving device
- Batch sampler should return the same results when used alone or in dataloader with `num_workers` > 0
- Fix broadcasting error in LogNormal, TransformedDistribution
- Fix [`torch.max`]( and [`torch.min`](  on CUDA in presence of `NaN`
- Fix [`torch.tensor`]( device-type calculation when used with CUDA
- Fixed a missing `'='` in `nn.LPPoolNd` repr function


- Expose and document `torch.autograd.gradcheck` and `torch.autograd.gradgradcheck`
- Document `tensor.scatter_add_`
- Document variants of [`torch.add`]( and `tensor.add_`, e.g. `tensor.add(value=1, other)` -> Tensor
- Document [`torch.logsumexp`](
- Document [`torch.sparse_coo_tensor`](
- Document [``](
- Document [`torch.nn.GroupNorm`](
- A lot of other various documentation improvements including RNNs, `ConvTransposeNd`, `Fold`/`Unfold`, `Embedding`/`EmbeddingBag`, Loss functions, etc.


PyTorch 0.4.0 release notes

Table of Contents

- Major Core Changes
- Tensor / Variable merged
- Zero-dimensional Tensors
- dtypes
- migration guide
- New Features
- Tensors
- Full support for advanced indexing
- Fast Fourier Transforms
- Neural Networks
- Trade-off memory for compute
- bottleneck - a tool to identify hotspots in your code
- torch.distributions
- 24 basic probability distributions
- Added cdf, variance, entropy, perplexity etc.
- Distributed Training
- Launcher utility for ease of use
- NCCL2 backend
- C++ Extensions
- Windows Support
- ONNX Improvements
- RNN support
- Performance improvements
- Bug fixes

Major Core changes

Here is a summary of the updates to the most important core features users will use daily.

**Major Changes and Potentially Breaking Changes:**
* ``Tensors`` and ``Variables`` have merged
* Some operations now return 0-dimensional (scalar) ``Tensors``
* Deprecation of the ``volatile`` flag

* ``dtypes``, ``devices``, and Numpy-style ``Tensor`` creation functions added
* Support for writing device-agnostic code

We wrote a [migration guide]( that should help you transition your code to new APIs and style. Please read it if you have code in a previous version of PyTorch that you would like to migrate.

**Please read the [migration guide]( if you have code in a previous version of PyTorch that you would like to migrate.**
**Please read the [migration guide]( if you have code in a previous version of PyTorch that you would like to migrate.**
**Please read the [migration guide]( if you have code in a previous version of PyTorch that you would like to migrate.**

The contents of this section (Major Core changes) are included in the [migration guide](

Merging [``Tensor``]( and ``Variable`` classes

``torch.autograd.Variable`` and [``torch.Tensor``]( are now the same class.  More precisely, [``torch.Tensor``]( is capable of tracking history and behaves like the old ``Variable``; ``Variable`` wrapping continues to work as before but returns an object of type [``torch.Tensor``](  This means that you don't need the ``Variable`` wrapper everywhere in your code anymore.

The `type()` of a [``Tensor``]( has changed

Note also that the ``type()`` of a Tensor no longer reflects the data type. Use ``isinstance()`` or ``x.type()`` instead:

>>> x = torch.DoubleTensor([1, 1, 1])
>>> print(type(x))  was torch.DoubleTensor
<class 'torch.autograd.variable.Variable'>
>>> print(x.type())   OK: 'torch.DoubleTensor'
>>> print(isinstance(x, torch.DoubleTensor))   OK: True

When does [``autograd``]( start tracking history now?

``requires_grad``, the central flag for [``autograd``](, is now an attribute on ``Tensor``s. Let's see how this change manifests in code.

[``autograd``]( uses the same rules previously used for ``Variable``s. It starts tracking history when any input ``Tensor`` of an operation has ``requires_grad=True``. For example,

>>> x = torch.ones(1)   create a tensor with requires_grad=False (default)
>>> x.requires_grad
>>> y = torch.ones(1)   another tensor with requires_grad=False
>>> z = x + y
>>>  both inputs have requires_grad=False. so does the output
>>> z.requires_grad
>>>  then autograd won't track this computation. let's verify!
>>> z.backward()
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
>>>  now create a tensor with requires_grad=True
>>> w = torch.ones(1, requires_grad=True)
>>> w.requires_grad
>>>  add to the previous result that has require_grad=False
>>> total = w + z
>>>  the total sum now requires grad!
>>> total.requires_grad
>>>  autograd can compute the gradients as well
>>> total.backward()
>>> w.grad
tensor([ 1.])
>>>  and no computation is wasted to compute gradients for x, y and z, which don't require grad
>>> z.grad == x.grad == y.grad == None

Manipulating ``requires_grad`` flag

Other than directly setting the attribute, you can change this flag **in-place** using [``my_tensor.requires_grad_(requires_grad=True)``](, or, as in the above example, at creation time by passing it in as an argument (default is ``False``), e.g.,

>>> existing_tensor.requires_grad_()
>>> existing_tensor.requires_grad
>>> my_tensor = torch.zeros(3, 4, requires_grad=True)
>>> my_tensor.requires_grad

What about ``.data``?

``.data`` was the primary way to get the underlying ``Tensor`` from a ``Variable``. After this merge, calling ``y =`` still has similar semantics. So ``y`` will be a ``Tensor`` that shares the same data with ``x``, is unrelated with the computation history of ``x``, and has ``requires_grad=False``.

However, ``.data`` can be unsafe in some cases. Any changes on ```` wouldn't be tracked by ``autograd``, and the computed gradients would be incorrect if ``x`` is needed in a backward pass. A safer alternative is to use [``x.detach()``](, which also returns a ``Tensor`` that shares data with ``requires_grad=False``, but will have its in-place changes reported by ``autograd`` if ``x`` is needed in backward.

Some operations now return 0-dimensional (scalar) ``Tensors``

Previously, indexing into a ``Tensor`` vector (1-dimensional tensor) gave a Python number but indexing into a ``Variable`` vector gave (incosistently!) a vector of size ``(1,)``!  Similar behavior existed with reduction functions, i.e. `tensor.sum()` would return a Python number, but `variable.sum()` would retun a vector of size `(1,)`.

Fortunately, this release introduces proper scalar (0-dimensional tensor) support in PyTorch!  Scalars can be created using the new `torch.tensor` function (which will be explained in more detail later; for now just think of it as the PyTorch equivalent of `numpy.array`).  Now you can do things like:

>>> torch.tensor(3.1416)          create a scalar directly
>>> torch.tensor(3.1416).size()   scalar is 0-dimensional
>>> torch.tensor([3]).size()      compare to a vector of size 1
>>> vector = torch.arange(2, 6)   this is a vector
>>> vector
tensor([ 2.,  3.,  4.,  5.])
>>> vector.size()
>>> vector[3]                     indexing into a vector gives a scalar
>>> vector[3].item()              .item() gives the value as a Python number



- Removed support for CUDA capability 3.0 and 5.0 (they still work for source builds for now, but the commitment to support this forward is removed)


Table of contents

- Breaking changes: removed `reinforce()`
- New features
- Unreduced losses
- A profiler for the autograd engine
- More functions support Higher order gradients
- New features in Optimizers
- New layers and nn functionality
- New Tensor functions and Features
- Other additions
- API changes
- Performance improvements
- Big reduction in framework overhead (helps small models)
- 4x to 256x faster Softmax/LogSoftmax
- More...
- Framework Interoperability
- DLPack Interoperability
- Model Exporter to ONNX (ship PyTorch to Caffe2, CoreML, CNTK, MXNet, Tensorflow)
- Bug Fixes (a lot of them)

Breaking changes

Stochastic functions, i.e. `Variable.reinforce()` were removed because of their limited functionality and broad performance implications. The motivation for stochastic functions was to avoid book-keeping of sampled values. In practice, users were still book-keeping in their code for various reasons. We constructed an alternative, equally effective API, but did not have a reasonable deprecation path to the new API. Hence this removal is a breaking change.

We introduce the [torch.distributions]( package to replace Stochastic functions.

Your previous code typically looked like this:

probs = policy_network(state)
action = probs.multinomial()
next_state, reward = env.step(action)

This is the new equivalent code:

probs = policy_network(state)
NOTE: categorical is equivalent to what used to be called multinomial
m = torch.distributions.Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward

New features

Unreduced losses

Now, Some loss functions can compute per-sample losses in a mini-batch
- By default PyTorch sums losses over the mini-batch and returns a single scalar loss. This was limiting to users.
- Now, a subset of loss functions allow specifying `reduce=False` to return individual losses for each sample in the mini-batch
- Example: `loss = nn.CrossEntropyLoss(..., reduce=False)`
- Currently supported losses: `MSELoss`, `NLLLoss`, `NLLLoss2d`, `KLDivLoss`, `CrossEntropyLoss`, `SmoothL1Loss`, `L1Loss`
- More loss functions will be covered in the next release

An in-built Profiler in the autograd engine

We built a low-level profiler to help you identify bottlenecks in your models

Let us start with an example:

>>> x = Variable(torch.randn(1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     y.backward()
>>>  NOTE: some columns were removed for brevity
... print(prof)
--------------------------------  ----------  ---------
Name                               CPU time   CUDA time
-------------------------------   ----------  ---------
PowConstant                        142.036us    0.000us
N5torch8autograd9GraphRootE         63.524us    0.000us
PowConstantBackward                184.228us    0.000us
MulConstant                         50.288us    0.000us
PowConstant                         28.439us    0.000us
Mul                                 20.154us    0.000us
N5torch8autograd14AccumulateGradE   13.790us    0.000us
N5torch8autograd5CloneE              4.088us    0.000us

The profiler works for both CPU and CUDA models.
For CUDA models, you have to run your python program with a special `nvprof` prefix. For example:

nvprof --profile-from-start off -o -- python <your arguments>

in python
>>> with torch.cuda.profiler.profile():
...     model(x)  Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

Then, you can load `` in PyTorch and print a summary profile report.

>>> prof = torch.autograd.profiler.load_nvprof('')
>>> print(prof)

[Read additional documentation here](

Higher order gradients

Added higher-order gradients support for the following layers

- ConvTranspose, AvgPool1d, AvgPool2d, LPPool2d, AvgPool3d, MaxPool1d, MaxPool2d, AdaptiveMaxPool, AdaptiveAvgPool, FractionalMaxPool2d, MaxUnpool1d, MaxUnpool2d, nn.Upsample, ReplicationPad2d, ReplicationPad3d, ReflectionPad2d
- PReLU, HardTanh, L1Loss, SoftSign, ELU, RReLU, Hardshrink, Softplus, SoftShrink, LogSigmoid, Softmin, GLU
- MSELoss, SmoothL1Loss, KLDivLoss, HingeEmbeddingLoss, SoftMarginLoss, MarginRankingLoss, CrossEntropyLoss
- DataParallel


- [optim.SparseAdam]( Implements a lazy version of Adam algorithm suitable for sparse tensors.
- In this variant, only moments that show up in the gradient get updated, and only those portions of the gradient get applied to the parameters.
- Optimizers now have an [add_param_group]( function that lets you add new parameter groups to an already constructed optimizer.

New layers and nn functionality

- Added AdpativeMaxPool3d and AdaptiveAvgPool3d
- Added LPPool1d
- [F.pad]( now has support for:
- 'reflection' and 'replication' padding on 1d, 2d, 3d signals (so 3D, 4D and 5D Tensors)
- constant padding on n-d signals
- nn.Upsample now works for 1D signals (i.e. B x C x L Tensors) in `nearest` and `linear` modes.
- [grid_sample]( now allows padding with the border value via `padding_mode="border"`. `grid_sample` expects a grid in the range of `[-1, 1]`, and if the values are out of these bounds, padding with the value `0.0` is applied by default. However, in a lot of cases, using the border value (i.e. the nearest valid value) helps improve accuracy of the overall model.
- Introducing `nn.utils.parameters_to_vector` and `nn.utils.vector_to_parameters`
- `parameters_to_vector` takes `net.parameters()` and return a 1D vector that contains all the parameters
- `vector_to_parameters` takes a vector of flattened parameters and copies the values over to a network's parameters
- Convenient for some reinforcement learning algorithms, such as cross-entropy method, TRPO etc., which need to pull all network parameters as one big vector, modify them, and put the modified vector back.
- Allow user to not specify certain input dimensions for `AdaptivePool*d` and infer them at runtime.
- For example:
target output size of 10x7
m = nn.AdaptiveMaxPool2d((None, 7))

- DataParallel container on CPU is now a no-op (instead of erroring out)

New Tensor functions and features
- Introduced `torch.erf` and `torch.erfinv` that compute the error function and the inverse error function of each element in the Tensor.
- adds broadcasting support to bitwise operators
- Added `Tensor.put_` and `torch.take` similar to `numpy.take` and `numpy.put`.
- The take function allows you to linearly index into a tensor without viewing it as a 1D tensor
first. The output has the same shape as the indices.
- The put function copies value into a tensor also using linear indices.
- Differences from `numpy` equivalents:
- `numpy.take` has an optional axis argument, which behaves like `index_select`. This `axis` argument is not yet present.
- `numpy.put` repeats the values if necessary to make them as long as indices. This behavior is not yet replicated.
- add `zeros` and `zeros_like` for sparse Tensors.
- 1-element Tensors can now be casted to Python scalars. For example: `int(torch.Tensor([5]))` works now.

Other additions

- Added `torch.cuda.get_device_name` and `torch.cuda.get_device_capability` that do what the names say. Example:
>>> torch.cuda.get_device_name(0)
'Quadro GP100'
>>> torch.cuda.get_device_capability(0)
(6, 0)

- If one sets ` torch.backends.cudnn.deterministic = True`, then the CuDNN convolutions use deterministic algorithms
- `torch.cuda_get_rng_state_all` and `torch.cuda_set_rng_state_all` are introduced to let you save / load the state of the random number generator over all GPUs at once
- `torch.cuda.emptyCache()` frees the cached memory blocks in PyTorch's caching allocator. This is useful when having long-running ipython notebooks while sharing the GPU with other processes.

API changes

- `softmax` and `log_softmax` now take a `dim` argument that specifies the dimension in which slices are taken for the softmax operation. `dim` allows negative dimensions as well (`dim = -1` will be the last dimension)
- `torch.potrf` (Cholesky decomposition) is now differentiable and defined on `Variable`
- Remove all instances of `device_id` and replace it with `device`, to make things consistent
- `torch.autograd.grad` now allows you to specify inputs that are unused in the autograd graph if you use `allow_unused=True`
This gets useful when using `torch.autograd.grad` in large graphs with lists of inputs / outputs
For example:
x, y = Variable(...), Variable(...)
torch.autograd.grad(x * 2, [x, y])  errors
torch.autograd.grad(x * 2, [x, y], allow_unused=True)  works

- `pad_packed_sequence` now allows a `padding_value` argument that can be used instead of zero-padding
- `Dataset` now has a `+` operator (which uses `ConcatDataset`). You can do something like `MNIST(...) + FashionMNIST(...)` for example, and you will get a concatenated dataset containing samples from both.
- `torch.distributed.recv` allows Tensors to be received from any sender (hence, `src` is optional). `recv` returns the rank of the sender.
- adds `zero_()` to `Variable`
- `Variable.shape` returns the size of the Tensor (now made consistent with Tensor)
- `torch.version.cuda` specifies the CUDA version that PyTorch was compiled with
- Add a missing function `random_` for CUDA.
- torch.load and can now take a `pathlib.Path` object, which is a standard Python3 typed filepath object
- If you want to load a model's `state_dict` into another model (for example to fine-tune a pre-trained network), `load_state_dict` was strict on matching the key names of the parameters. Now we provide a `strict=False` option to `load_state_dict` where it only loads in parameters where the keys match, and ignores the other parameter keys.
- added `nn.functional.embedding_bag` that is equivalent to `nn.EmbeddingBag`

Performance Improvements

- The overhead of `torch` functions on Variables was around 10 microseconds. This has been brought down to ~1.5 microseconds by moving most of the core autograd formulas into C++ using our ATen library. This speeds-up models that are very small, such as small LSTMs and other common models seen in NLP.
- softmax and log_softmax are now [4x to 256x faster]( on the GPU after rewriting the gpu kernels
- 2.5x to 3x performance improvement of the distributed AllReduce (gloo backend) by enabling GPUDirect
- nn.Embedding's renorm option is much faster on the GPU. For embedding dimensions of `100k x 128` and a batch size of 1024, it is 33x faster.
- All pointwise ops now use OpenMP and get multi-core CPU benefits
- Added dedicated CUDA kernels for group convolutions where `groups == nInputPlane` (depthwise convolution). Speedups range from 5x to 1000x for tested layer sizes. See the [benchmark table]( for more details as well as [this table](
- Fixed `optim.SGD`'s memory usage for sparse gradients (for ex. `nn.Embedding(..., sparse=True)`), reducing the usage on a user-provided test script by 10x.
- Optional NNPack integration for faster CPU convolutions (not part of binaries)
- Reduce overhead of broadcasting if Tensors aren't broadcastable
- `torch.nn.utils.weight_norm` over the right-most dimensions is faster
- Backward of `torch.norm` is sped up by ~1.5x
- Improve the performance of `pack_padded_sequence`
- Add a single-argument version of `torch.arange`. For example `torch.arange(10)`

Framework Interoperability

DLPack Interoperability

[DLPack Tensors]( are cross-framework Tensor formats. We now have `torch.utils.to_dlpack(x)` and `torch.utils.from_dlpack(x)` to convert between DLPack and torch Tensor formats. The conversion has zero memory copy and hence is very efficient.

Model exporter to ONNX

[ONNX]( is a common model interchange format that can be executed in Caffe2, CoreML, CNTK, MXNet, Tensorflow at the moment. PyTorch models that are ConvNet-like and RNN-like (static graphs) can now be shipped to the ONNX format.

- There is a new module torch.onnx ( which provides the API for exporting ONNX models.

- The operations supported in this release are:
- add, sub (nonzero alpha not supported), mul, div, cat, mm, addmm, neg, tanh, sigmoid, mean, t, transpose, view, split, squeeze
- expand (only when used before a broadcasting ONNX operator; e.g., add)
- prelu (single weight shared among input channels not supported)
- threshold (non-zero threshold/non-zero value not supported)
- Conv, ConvTranspose, BatchNorm, MaxPool, RNN, Dropout, ConstantPadNd, Negate
- elu, leaky_relu, glu, softmax, log_softmax, avg_pool2d
- unfold (experimental support with ATen-Caffe2 integration)
- Embedding (no optional arguments supported)
- FeatureDropout (training mode not supported)
- Index (constant integer and tuple indices supported)

Usability Improvements

- More cogent error messages during indexing of Tensors / Variables
Breaking changes
- Add proper error message for specifying dimension on a tensor with no dimensions
- better error messages for Conv*d input shape checking
- More user-friendly error messages for LongTensor indexing
- Better error messages and argument checking for Conv*d routines
- Trying to construct a Tensor from a Variable fails more appropriately
- If you are using a PyTorch binary with insufficient CUDA version, then a `warning` is printed to the user.
- Fixed incoherent error messages in `load_state_dict`
- Fix error message for type mismatches with sparse tensors

Bug fixes


- Fix CUDA lazy initialization to not trigger on calls to `torch.manual_seed` (instead, the calls are queued and run when CUDA is initialized)


- if `x` is 2D, `x[[0, 3],]` was needed to trigger advanced indexing. The trailing comma is no longer needed, and you can do `x[[0, 3]]`
- `x.sort(descending=True)` used to incorrectly fail for Tensors. Fixed a bug in the argument checking logic to allow this.
- Tensor constructors with numpy input: `torch.DoubleTensor(np.array([0,1,2], dtype=np.float32))`
- torch will now copy the contents of the array in a storage of appropriate type.
- If types match, it will share the underlying array (no-copy), with equivalent semantics to initializing a tensor with another tensor.
- On CUDA, `torch.cuda.FloatTensor(np.random.rand(10,2).astype(np.float32))` will now work by making a copy.
- `ones_like` and `zeros_like` now create Tensors on the same device as the original Tensor
- `torch.multinomial` on the CPU would reshape the input `prob_dist` in-place. Fixed this to make sure the `prob_dist` input's shape is unchanged after the call to `multinomial`
- `expand` and `expand_as` allow expanding an empty Tensor to another empty Tensor
- when `[..., None, ...]` was given (i.e. newaxis placement in indexing was specified), PyTorch had different behavior from NumPy. This is made consistent with NumPy in all cases.
- Fix exponential distribution implementation to never sample infinity - cuRAND returns numbers in (0, 1]
- torch.HalfTensor supports `numpy()` and `torch.from_numpy`
- Add additional size checking for `torch.scatter`
- fix `torch.tril` and `torch.triu` on the GPU for storage-offset Tensors (would return incorrect result).
- Fix a memory leak in CUDA qr decomposition
- Fix stream-awareness issues in THCUNN kernels
- Fix kwargs parsing in `torch.topk`
- Fixed `random_` on CPU (which previously had a max value of 2^32) for DoubleTensor and LongTensor
- Fix `ZeroDivisionError: float division by zero` when printing certain Tensors
- `torch.gels` when `m > n` had a truncation bug on the CPU and returned incorrect results. Fixed.
- Add a check in tensor.numpy() that checks if no positional arguments are passed
- Before a Tensor is moved to CUDA pinned memory, added a check to ensure that it is `contiguous`
- `any` and `all` work on empty Tensors on the cpu (previously errored out)
- Fix `symeig` on CUDA for large matrices. The bug is that not enough space was being allocated for the workspace, causing some undefined behavior.
- Improved the numerical stability of `torch.var` and `torch.std` by using Welford's algorithm
- The Random Number Generator returned `uniform` samples with inconsistent bounds (inconsistency in cpu implementation and running into a cublas bug).
- Now, all `uniform` sampled numbers will return within the bounds `[0, 1)`, across all types and devices
- Fix `torch.svd` to not segfault on large CUDA Tensors (fixed an overflow error in the magma bindings)
- Allows empty index Tensor for `index_select` (instead of erroring out)
- Previously when `eigenvector=False`, `symeig` returns some unknown value for the eigenvectors. Now we zero them out.


- Fix bug with 'coalesced' calculation in sparse 'cadd'
- Fixes `.type()` not converting indices tensor.
- Fixes sparse tensor coalesce on the GPU in corner cases


- Fixed crashes when calling backwards on leaf variable with requires_grad=False
- fix bug on Variable `type()` around non-default GPU input.
- when `torch.norm` returned `0.0`, the gradient was `NaN`. We now use the subgradient at `0.0`, so the gradient is `0.0`.
- Fix an correctness issue with advanced indexing and higher-order gradients
- ``'s backward was failing on the GPU due to a type error, fixed.
- Advanced Indexing on Variables now allows the index to be a LongTensor backed Variable
- Variable.cuda() and Tensor.cuda() are consistent in kwargs options


- `torch.optim.lr_scheduler` is now imported by default.


- Returning a dictionary from a nn.Module's forward function is now supported (used to throw an error)
- When `register_buffer("foo", ...)` is called, and already exists, then instead of silently failing, now raises a `KeyError`
- Fixed loading of older checkpoints of RNN/LSTM which were missing `_data_ptrs` attributes.
- `nn.Embedding` had a hard error when using the `max_norm` option. This is fixed now.
- when using the `max_norm` option, the passed-in indices are written upon (by the underlying implementation). To fix this, pass a clone of the indices to the renorm kernel.
- `F.affine_grid` now can take non-contiguous inputs
- EmbeddingBag can accept both 1D and 2D inputs now.
- Workaround a CuDNN bug where batch sizes greater than 131070 fail in CuDNN BatchNorm
- fix nn.init.orthogonal to correctly return orthonormal vectors when rows < cols
- if BatchNorm has only `1` value per channel in total, raise an error in training mode.
- Make cuDNN bindings respect the current cuda stream (previously raised incoherent error)
- fix grid_sample backward when gradOutput is a zero-strided Tensor
- Fix a segmentation fault when reflection padding is out of Tensor bounds.
- If LogSoftmax has only 1 element, `-inf` was returned. Now this correctly returns `0.0`
- Fix pack_padded_sequence to accept inputs of arbitrary sizes (not just 3D inputs)
- Detect pointer aliasing in cuDNN RNN flatten_parameters and avoid that path.
- Fixed ELU higher order gradients when applied in-place
- Workaround a CuDNN RNN bug for half-precision
- Prevent numerical issues with `poisson_nll_loss` when `log_input=False` by adding a small epsilon

distributed and multi-gpu

- Allow kwargs-only inputs to DataParallel. This used to fail: `n = nn.DataParallel(Net()); out = n(input=i)`
- DistributedDataParallel calculates num_samples correctly in python2
- Fix the case of DistributedDataParallel when 1-GPU per process is used.
- Fixed DataParallel to specify GPUs that don't include GPU-0
- DistributedDataParallel's exit doesn't error out anymore, the daemon flag is set.
- Fix a bug in DistributedDataParallel in the case when model has no `buffers` (previously raised incoherent error)
- Fix `__get_state__` to be functional in `DistributedDataParallel` (was returning nothing)
- Fix a deadlock in the NCCL bindings when GIL and CudaFreeMutex were starving each other


- `model.zoo.load_url` now first attempts to use the `requests` library if available, and then falls back to `urllib`
- Fix error when default_collate is passed a collection of `numpy.str_`


Here comes the next major release of PyTorch, just in time for ICML.  Install it today from our website
Package documentation for this release is available at [](

We're introducing long-awaited features such as Broadcasting, Advanced Indexing, Higher-order gradients and finally: Distributed PyTorch.

**Due to introducing Broadcasting, the code behavior for certain broadcastable situations is different from behavior in 0.1.12. This might lead to silent bugs in your existing code. We've provided easy ways of identifying this ambiguous code in the *Important Breakages and Workarounds* section.**

Table of contents:
- Tensor Broadcasting (numpy-style)
- Advanced Indexing for Tensors and Variables
- Higher-order gradients
- Distributed PyTorch (multi-node training, etc.)
- Neural Network layers and features: SpatialTransformers, WeightNorm, EmbeddingBag, etc.
- New in torch and autograd: matmul, inverse, etc.
- Easier debugging, better error messages
- Bug Fixes
- **Important Breakages and Workarounds**

Tensor Broadcasting (numpy-style)

In short, if a PyTorch operation supports broadcasting, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data).

PyTorch Broadcasting semantics [closely follow numpy-style broadcasting](; if you are familiar with numpy broadcasting, things should just work as expected.

General Semantics

Two tensors are “broadcastable” if the following rules hold:
- Each tensor has at least one dimension.
- When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

For Example:

>>> x=torch.FloatTensor(5,7,3)
>>> y=torch.FloatTensor(5,7,3)
same shapes are always broadcastable (i.e. the above rules always hold)

can line up trailing dimensions
>>> x=torch.FloatTensor(5,3,4,1)
>>> y=torch.FloatTensor(  3,1,1)

x and y are broadcastable.
1st trailing dimension: both have size 1
2nd trailing dimension: y has size 1
3rd trailing dimension: x size == y size
4th trailing dimension: y dimension doesn't exist

>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
x and y are not broadcastable, because in the 3rd trailing dimension 2 != 3

If two tensors x, y are "broadcastable", the resulting tensor size is calculated as follows:
- If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the tensor with fewer dimensions to make them equal length.
- Then, for each dimension size, the resulting dimension size is the max of the sizes of x and y along that dimension.

For Example:

can line up trailing dimensions to make reading easier
>>> x=torch.FloatTensor(5,1,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
torch.Size([5, 3, 4, 1])

error case
>>> x=torch.FloatTensor(5,2,4,1)
>>> y=torch.FloatTensor(  3,1,1)
>>> (x+y).size()
RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1

More details [can be found on the PyTorch documentation site](  Also, each torch function lists its broadcasting semantics in the documentation.

Advanced Indexing for Tensors and Variables

PyTorch now supports a subset of NumPy style [advanced indexing]( This allows users to select arbitrary indices at each dimension of the Tensor, including non-adjacent indices and duplicate indices, using the same `[]`-style operation. This allows for a more flexible indexing strategy without needing calls to PyTorch's `Index[Select, Add, ...]`  functions.

Let's look at some examples:

x = torch.Tensor(5, 5, 5)

**Pure Integer Array Indexing - specify arbitrary indices at each dimension**

x[[1, 2], [3, 2], [1, 0]]
--> yields a 2-element Tensor (x[1][3][1], x[2][2][0])

**also supports broadcasting, duplicates**

x[[2, 3, 2], [0], [1]]
--> yields a 3-element Tensor (x[2][0][1], x[3][0][1], x[2][0][1])

**arbitrary indexer shapes allowed**

x[[[1, 0], [0, 1]], [0], [1]].shape
--> yields a 2x2 Tensor [[x[1][0][1], x[0][0][1]],
[x[0][0][1], x[1][0][1]]]

**can use colon, ellipse**

x[[0, 3], :, :]
x[[0, 3], ...]
--> both yield a 2x5x5 Tensor [x[0], x[3]]

**also use Tensors to index!**

y = torch.LongTensor([0, 2, 4])
x[y, :, :]
--> yields a 3x5x5 Tensor [x[0], x[2], x[4]]

**selection with less than ndim, note the use of comma**

x[[1, 3], ]
--> yields a 2x5x5 Tensor [x[1], x[3]]

Higher order gradients

Now you can evaluate higher order differentials in PyTorch. For example, you can compute Hessian-Vector products, penalize the norm of the gradients of your model, implement Unrolled GANs and Improved WGANs, etc.

In the `0.2` release, we've enabled the ability to compute higher order gradients for all of `torch.XXX` functions and the most popular `nn`layers. The rest will be covered in the next release.

Here's a short example that penalizes the norm of the weight gradients of a Resnet-18 model, so that the volume of weights is slow-changing.

import torch
from torchvision.models import resnet18
from torch.autograd import Variable

model = resnet18().cuda()

dummy inputs for the example
input = Variable(torch.randn(2,3,224,224).cuda(), requires_grad=True)
target = Variable(torch.zeros(2).long().cuda())

as usual
output = model(input)
loss = torch.nn.functional.nll_loss(output, target)

grad_params = torch.autograd.grad(loss, model.parameters(), create_graph=True)
torch.autograd.grad does not accumuate the gradients into the .grad attributes
It instead returns the gradients as Variable tuples.

now compute the 2-norm of the grad_params
grad_norm = 0
for grad in grad_params:
grad_norm += grad.pow(2).sum()
grad_norm = grad_norm.sqrt()

take the gradients wrt grad_norm. backward() will accumulate
the gradients into the .grad attributes

do an optimization step

We see two new concepts here:

1. [torch.autograd.grad]( is a function that takes in [outputs, list of inputs (for which you want gradients)], and returns the gradients wrt. these inputs as a tuple, rather than accumulating the gradients into the `.grad` attributes. This is useful if you want to further operate on the gradients.
2. You can operate on the gradients, and call `backward()` on them.

The list of `nn` layers that support higher order gradients are:
- `AvgPool*d`, `BatchNorm*d`, `Conv*d`, `MaxPool1d,2d`, `Linear`, `Bilinear`
- `pad`, `ConstantPad2d`, `ZeroPad2d`, `LPPool2d`,  `PixelShuffle`
- `ReLU6`, `LeakyReLU`, `PReLU`, `Tanh`, `Tanhshrink`, `Threshold`, `Sigmoid`, `HardTanh`, `ELU`, `Softsign`, `SeLU`
- `L1Loss`, `NLLLoss`, `PoissonNLLLoss`, `LogSoftmax`, `Softmax2d`
The rest will be enabled in the next release.

To enable higher order gradients, we've introduced a new style of writing `autograd.Function` (the current/old style of writing functions is fully backward compatible). [You can read more about the new style of functions here](

Most of you dont write your own `autograd.Function`s, they are low-level primitives that introduce
new operations to the autograd engine, where you specify the forward and backward calls.

Distributed PyTorch

We introduce the [torch.distributed]( package that allows you to exchange Tensors among multiple machines. Using this package, you can scale your network training over multiple machines and larger mini-batches. For example, you are given the primitives to implement [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](

The `distributed` package follows an MPI-style programming model. This means that there are functions provided to you such as `send`, `recv`, `all_reduce` that will exchange Tensors among nodes (machines).

For each of the machines to first identify each other and assign unique numbers to each other (ranks), we provide simple initialization methods:
- shared file system (requires that all processes can access a single file system)
- IP multicast (requires that all processes are in the same network)
- environment variable (requires you to manually assign ranks and know an address of a node reachable from all processes)

Our package documentation contains more details on initialization and available backends, but here's an example of initializing using a multicast address:

import torch.distributed as dist


print('Hello from process {} (out of {})!'.format(
dist.get_rank(), dist.get_world_size()))

This would print `Hello from process 2 (out of 4)`on the 3rd machine.

World size is the number of processes that will participate in the job. Each will be assigned a rank, which is a number between 0 and world_size - 1, unique within this job. It will serve as a process identifier and will be used instead of an address to, for example, specify to which process should a tensor be sent.

Here's a snippet that shows how simple point-to-point communication can be performed:

All processes (receiving ones too!) need to have tensors of appropriate
size preallocated.
x = torch.Tensor(10)
if dist.get_rank() == 0:
Send x to process with rank 1
dist.send(x, dst=1)
else:   rank == 1
Receive data from process with rank 0 and save result in x
dist.recv(x, src=0)

Asynchronous p2p functions (`isend`, `irecv`) are available too.

However, some communication patterns appear so often that more efficient collective calls have been developed. They typically engage the whole process group and are much faster than naive algorithms using `send`/`recv`. One example is `all_reduce`:

x = torch.Tensor([dist.get_rank()])
Add tensors from all processes such that they all receive the result.
x is an input and output to this operation.

The distributed package is fairly low-level, so that it allows to implement more advanced algorithms and tailor the code to very specific purposes, but data-parallel training is such a common one that we have created high-level helpers for it.

Hence, we've introduced `DistributedDataParallel`, which is meant to be a nearly drop-in replacement for nn.DataParallel.
Here's a code snippet demonstrating changes necessary to add it to existing training code:

Wrap model in DistributedDataParallel (CUDA only for the moment)
model = torch.nn.parallel.DistributedDataParallel(model.cuda())

Use a DistributedSampler to restrict each process to a distinct subset
of the dataset.
train_dataset = ...
train_sampler =
train_loader =
train_dataset, batch_size=args.batch_size, num_workers=args.workers,
pin_memory=True, sampler=train_sampler)

for epoch in range(args.num_epochs):
Use .set_epoch() method to reshuffle the dataset partition at every iteration
training loop

You can see a fuller [Imagenet training example here](

New nn layers: SpatialTransformers, WeightNorm, EmbeddingBag, etc.

New features
- [forward_pre_hook]( is introduced to execute user-specified closures right before a forward function is called.
- Convenient access to non-leaf gradients:
Currently, to access and inspect gradients of intermediate values, we have to use `hooks`. This is not convenient for doing simple inspections. Hence, we introduce `retain_grad`. It is best explained via an example:

input = Variable(torch.rand(1, 3), requires_grad=True)
h1 = input * 3
out = (h1 * h1).sum()


without calling retain_grad(), h1.grad is None

- DataParallel now supports dicts as inputs

New Layers

- Spatial Transformer Networks via `F.grid_sample` and `F.affine_grid`
- `nn.SeLU` and `nn.AlphaDropout` are introduced, from the paper: [Self-Normalizing Neural Networks](
- `nn.GLU` (Gated Linear Unit) is introduced from the paper [Convolutional Sequence to Sequence Learning](
- [Weight Normalization]( is now implemented via [torch.utils.weight_norm](
- You can now ignore specific target indices while computing `cross_entropy_loss` and `nll_loss` using the `ignore_index` argument. This is a cheap and useful way of implementing masking, where you can have a `mask` index that is ignored in computing the loss.
- `F.normalize` implements dimension-wise renormalization
- `F.upsample` and `nn.Upsample` consolidate multiple Upsampling layers into one function. It implements 2d and 3d bilinear/trilinear/nearest upsampling.
- `nn.EmbeddingBag`: When build bag-of-words models, doing an `Embedding` followed by `Sum` or `Mean` is common. For variable length sequences, computing bags of embeddings involves masking. We provide a singe `nn.EmbeddingBag` which is much more efficent and faster to compute bags of embeddings, especially for variable length sequences.
- Numerically stable Binary Cross-Entropy loss via `bce_with_logits`
- A negative log-likelihood loss with Poisson distribution of the target via `PoissonNLLLoss`
- `cosine_similarity`: Returns cosine similarity between x1 and x2, computed along dim.

training utilities

*Learning Rate Schedulers:* [torch.optim.lr_scheduler]( provides several dumb and smart methods to adjust the current learning rate. They are quite convenient while experimenting, giving a proxy for what you as the user would likely want to do.

There are various strategies provided, which can be used depending on the appropriate situation, more can be read in the [package docs](
- ReduceLROnPlateau, LambdaLR, StepLR, MultiStepLR, ExponentialLR

*`ConcatDataset`* that is a convenient dataset meta-class that can merge and concatenate two individual datasets.

New in torch and autograd

- All reduce functions such as `sum` and `mean`now default to squeezing the reduced dimension. For example `torch.sum(torch.randn(10, 20), 0)` returns a 1D Tensor.
- `x.shape`, similar to numpy. A convenience `property` that is equivalent to `x.size()`
- `torch.matmul`, similar to np.matmul
- bitwise and, or, xor, lshift, rshift
- autograd support for `inverse`, `gesv`, `cumprod`, `atan2`
- unbiased `var` and `std` now available via keyword argument option
- `torch.scatter_add` - torch.scatter, except when duplicate indices are encountered, the values are summed.
- torch.median behaves similar to torch.sum when no arguments are given, i.e. it reduces all the dimensions and returns a single median value of the flattened Tensor.
- masked_copy_ has been renamed to masked_scatter_ (with deprecation on masked_copy_)
- torch.manual_seed now seeds all CUDA devices as well
- You can now specify the random number generator object via keyword arguments `torch.rand(1000, generator=gen)`

Bug-fixes and small improvements

- Now we emit an error when a Variable is converted to a bool. For example:

b = Variable(torch.zeros(1))
if b[0]:  errors now

- Fix correctness bugs in qr decomposition on CUDA.
- Support for IBM PowerPC64 platform
- Check that the CuDNN version at compile-time is the same version at run-time.
- Improve error message in CUDA forked subprocess
- Faster transposed-copy on CPU
- Improve error messages in InstanceNorm
- Add more argument checking for various routines, especially BatchNorm and Convolution routines.
- Better error messages around shape reporting across the CPU backend.
- Support more than 8 GPUs per machine (work-around a CUDA p2p restriction)
- Improve error message when accessing attributes that don't exist
- t() of Variable consistent with Tensor
- prevent divide-by-zero when dropout p=1
- fix sharing of CUDA tensors on non-current devices
- when BN epsilon < allowed CuDNN value, fallback to THNN
- Fix thread-trashing when using different number of threads for MKL and OMP
- improve memory usage when using CuDNN RNN
- Fix ZeroPad2d backwards with negative padding
- add dummy property, to provide interpretable error message to users
- Fix in-place division for Python3
- Raise error when call from_numpy on 0-dim array
- Empty Tensors dont error out when shared across multiprocessing
- fix baddbmm for expanded tensors
- Let parallel_apply accept arbitrary inputs
- keyword arguments in Tensor and Variable are now consistent
- fix torch.inverse when Magma is not available
- Add logical not operator for ByteTensor
- add device asserts in scatter/gather kernels

Important Breakages and Workarounds

As you've read, we've introduced two important changes that are not
backward compatible:
- Numpy-style Broadcasting
- Reduction functions such as `sum(1)` now default to `keepdim=False`

We provide different levels of Python warnings that you can enable to alert you if you are using deprecated behavior or if the behavior of your code has changed.

Here is a code snippet that you can add to the top of your scripts.
Adding this code will generate warnings highlighting incompatible code.

Fix your code to no longer generate warnings.

insert this to the top of your scripts (usually
import sys, warnings, traceback, torch
def warn_with_traceback(message, category, filename, lineno, file=None, line=None):
sys.stderr.write(warnings.formatwarning(message, category, filename, lineno, line))
warnings.showwarning = warn_with_traceback; warnings.simplefilter('always', UserWarning);
torch.utils.backcompat.broadcast_warning.enabled = True
torch.utils.backcompat.keepdim_warning.enabled = True

Once all warnings disappear, you can remove the code snippet.

More elaborately

Now, let us see the three incompatible changes with examples.

Using the (now deprecated) 1-dimensional view pointwise function

Prior versions of PyTorch allowed certain pointwise functions to execute on tensors with different shapes, as long as the number of elements in each tensor was equal.  The pointwise operation would then be carried out by viewing each tensor as 1-dimensional. PyTorch now supports broadcasting. The “1-dimensional” pointwise behavior is considered deprecated and will generate a Python warning in cases where tensors are not broadcastable, but have the same number of elements.

For example:

>>> torch.add(torch.ones(4), torch.ones(2,2))
__main__:1: UserWarning: self and other not broadcastable, but have the same
number of elements.  Falling back to deprecated pointwise behavior.
[torch.FloatTensor of size 4]

Broadcasting in code where it didn't happen before
The introduction of broadcasting can cause backwards incompatible changes in the case where two tensors do not have the same shape,
but are broadcastable and have the same number of elements.

For example:

>>> torch.add(torch.ones(4,1), torch.randn(4))

would previously produce a Tensor with size: `torch.Size([4,1])`,
but now produces a Tensor with size: `torch.Size([4,4])`.

In order to help identify cases in your code where backwards incompatibilities introduced by broadcasting may exist, you may set `torch.utils.backcompat.broadcast_warning.enabled` to `True`, which will generate a python warning in such cases.

For Example:

>>> torch.utils.backcompat.broadcast_warning.enabled=True
>>> torch.add(torch.ones(4,1), torch.ones(4))
__main__:1: UserWarning: self and other do not have the same shape, but are broadcastable, and have the same number of elements.

Note that this setting can trigger warnings for valid uses of broadcasting (including in library code), so you probably want to turn this warning off after migrating your code.

KeepDim=False for Reduction Functions

To get a warning when using a dimensional reduction function with the default keepdim argument, set `torch.utils.backcompat.keepdim_warning.enabled` to `True`.  For example:

>>> torch.sum(torch.ones(2,3), 1)
__main__:1: UserWarning: backwards compatibility: call to "sum" uses default value for keepdim which has changed default to False.  Consider passing as kwarg.
[torch.FloatTensor of size 2]

As with `torch.utils.backcompat.broadcast_warning.enabled`, this warning can trigger from valid code, so you most likely want to disable this warning after migrating your code.

Note also that using `keepdim=False` can cause your existing code to "just work" with broadcasting.  For example:

behavior with (old) keepdim=True, causes accidental broadcast
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=True))
5  5  5  5
5  5  5  5
5  5  5  5
5  5  5  5
[torch.FloatTensor of size 4x4]

new behavior with keepdim=False is equivalent to non-broadcasted result
>>> torch.add(torch.ones(4), torch.ones(4,4).sum(dim=1, keepdim=False))
[torch.FloatTensor of size 4]


API Changes
- `torch.range` is deprecated in favor of `torch.arange` which is consistent with numpy and python range.
- On sparse Tensors, `contiguous` is renamed to `coalesce` and `coalesce` is now made out-of-place.
(a reminder that Sparse API is still experimental and evolving, so we dont provide backward-compability).

New Features

New layers and functions
- `torch.topk` is now supported for all CUDA types, not just `torch.cuda.FloatTensor`.
- Added a three-way ranking loss: [nn.TripletMarginLoss](
- Added per-instance normalization layers: [nn.InstanceNorm1d](, [nn.InstanceNorm2d](, [nn.InstanceNorm3d](
Each channel is treated as an instance to normalize, and mean-subtraction and std-division is done. This is useful when dealing with larger images and smaller mini-batches where BatchNorm like effects are desired.
- `nn.ZeroPad2d` and `nn.ConstantPad2d` are added.
- `nn.Bilinear` is added, which computes `Y = X1 * W * X2 + b`

Negative dimension support for all functions
Every single function that took a dimension argument will also allow taking negative dimensions.

A negative dimension will index the tensor from the last dimension.

For example:

x = torch.randn(10, 20, 30)
y = torch.mean(x, dim = -1)

Here, since `x` has 3 dimensions, and `dim = -1`, the last dimension, i.e. `dim=3` is picked for taking a mean.

The functions with dimension arguments are:

narrow, transpose, size, cat, chunk, gather, index_select, split, squeeze,
stack, unbind, unsqueeze, cumprod, cumsum, mean, median, mode, norm, prod, std,
sum, var, kthvalue, max, min, sort, topk, renorm,
index_add, index_copy, index_fill, scatter, select, unfold

CUDA support for Sparse Tensors, faster CPU sparse

Now a part of the `torch.sparse` API is also supported for `torch.cuda.sparse.*Tensor`.

Functions that are supported on CUDA are:

sparse_mask, to_dense, coalesce, transpose, spaddmm
spcadd, mul, div, cadd, csub, cmul

`nn.Embedding` now supports sparse even on CUDA (with the `sparse=True` flag) leveraging these sparse functions.

A new hybrid matrix-multiply `hspmm` operation that multiplies a sparse matrix with a dense matrix and returns a matrix in the form of a hybrid tensor (i.e. 1 sparse dimension, 1 dense dimension).

Several of the CPU sparse functions have more efficient implementations.

In a quickly hacked up Embedding classifier training script by martinraison we see CUDA sparse performing as well as CUDA dense:

Table times of seconds / batch

_      | CPU  | CUDA
Dense  | 10   | 0.86


Minor API Changes

- in `optim.Adamax`, the default learning rate and epsilon have been made
consistent with Lasagne, Keras and TF.
- Previous: `(lr=1e-2, eps=1e-38)`
- Current : `(lr=2e-3, eps=1e-8)`
- **Make `random_` range exclusive** (it used to be exclusive when only the upper bound was specified, and inclusive when both were given).
- `` now **disallows catting along inexistent dimensions**
(to make it consistent with numpy and Variable cat)
- `torch.utils.clip_grad_norm` now returns the total norm (say, for logging purposes).

Performance Improvements
- Reduce DataParallel overhead on >4 GPUs
- Improve broadcast/reduce performance by coalescing tensors
- `nn.Embedding`'s backward performance increased for batch sizes > 1024

New Features
- Batch triangular factorization and solves have been interfaced (CPU and GPU) and
are available under `torch.btrifact` and `torch.btrisolve`. [See documentation
for usage](
- All RNG functions now have `generator` specifiable via a keyword argument
- `torch.mode` is now supported on the GPU via a high-performance kernel.

**autograd, nn and optim**
- CuDNN v6 integrated:
- Faster Dilated Convolutions (and less memory hungry)
- 1D FFT-based Convolutions
- Significant performance improvement for Softmax layers
- Speedups across many functions
- Improved CuDNN error messages
- We will integrate persistent RNNs in the next release
- `torch.trace`, `torch.cumsum`, `torch.cross` are now implemented in autograd
- `nll_loss` now supports Spatial inputs (i.e. 4d inputs BCHW) and computes
channel-wise cross-entropy.
- `nn.PReLU` now supports all dimensional Tensors, not just 1d and 2d.
- add `nn.PairwiseDistance` and `F.pairwise_distance` that compute batchwise
pairwise distance between two vectors.
- Adaptive Max and Average Pooling added for 1d, 2d inputs via
`nn.AdaptiveMaxPooling1d`, `nn.AdaptiveAvgPooling2d`, etc.
- RMSProp now has `momentum` and a `centered` option. If `centered` is True,
the gradient is normalized by an estimation of it's variance. (Graves 2013)

- `WeightedRandomSampler` has been added as a custom sampler for the DataLoader.
It samples elements from `[0,..,len(weights)-1]` with the given probabilities
and is useful to sample from unbalanced datasets where some classes have
many more samples than others. [See the docs](
for more details
- DataLoader now allows returning of numpy arrays

Bug Fixes
- When loading GPU checkpoints from disk with storage location remapping,
`torch.cuda` was still attempted to be imported. This is now fixed, and
you can load GPU checkpoints on machines with no GPUs or CUDA.
- Work around an OSX `fread` bug where loading checkpoints of each Tensor > 1GB
would give an error.
- Fixed a in `` where it now does not
accept `reverse` (it's not a `PySequence`)
For example:

l = [Variable(torch.ones(1,3)*i) for i in range(3)], 0)  errors now

- Fix a memory leak in `torch.from_numpy`
- GPU svd returned a larger matrix than expected in the `some` mode.
This is now fixed to match CPU behavior.
- Fix a bug in CPU max that was introduced in the previous release.

**autograd, nn and optim**
- Reassigning attributes in modules correctly works now.
This example used to not work correctly, `l.a` always remained `None`.
Now it works as one would expect:
l = nn.Linear(10, 20)
l.a = None
l.a = nn.Parameter(torch.randn(2))
l.a is correctly updated

- Fix bug where adding a hook could replace an existing hook
- Fix `nn.Embedding` and `nn.CosineEmbeddingLoss` to work without
error on non-float CUDA (half, double)
- Fix a bug in `nn.Embedding` when the `max_norm` option was used. Some of the
indices were not respecting `max_norm` and this is fixed.
- Fix corner-case in `Variable`'s SetItem where gradient was of incorrect shape.
`x.grad` used to be of shape 20, because `y[1]`` was of shape 20.

x = Variable(torch.randn(1, 20), requires_grad=True)
y = Variable(torch.zeros(10, 20))
y[1] = x

- Fix a segfault in Conv1d when input doesn't require grad.
- Assertions in `pack_padded_sequence` to check that sequence is of length > 0
- ``'s autograd forumlae were incorrect if the Tensor had 0. This
formula has been fixed.
- Variable `expand` and `expand_as` had incorrect dimension inference when using
broadcasting semantics. The formula has been fixed in these cases.
- Fix a size mismatch in `CosineEmbeddingLoss`. [See this issue]( for more details.
- Fixed a bug in LBFGS that caused it to use uninitialized locals. [See issue](
- Add assertions for negative padding in `nn.Conv*` functions.
- Fix the sttdev gradient formula for the stochastic function `normal`.

- Fix issue when returning strings from the DataLoader when `pin_memory=True`
- Binaries no longer dependent on needing a `` at runtime.


New Features

Indexing and Broadcasting Improvements

- Add broadcasting semantics to `expand` / `expand_as`.
- Previously, `expand` had no ability to add new dimensions, and `unsqueeze`
had to be used to first create singleton dimensions before expansion.
- Now, singleton dimensions are automatically prepended to the shape of
the tensor if a matching dimension is found.
Here's an example:
x = torch.rand(5)
y = torch.rand(4, 8, 5)
z = x.expand_as(y)  z is of shape (4, 8, 5)

x = torch.rand(1, 8, 1)
z.expand_as(y)  z is of shape (4, 8, 5)

- Unsqueeze dimensions using None indexing
a = torch.randn(10)
b = a.unsqueeze(0)
b = a[None, :]      Equivalent operations

- Indexing with steps is supported (only positive steps)
In [1]: a = torch.randn(10)
In [2]: a

[torch.FloatTensor of size 10]

In [3]: a[0:10:3]

[torch.FloatTensor of size 4]

Variable-length mini-batches in Recurrent Networks
`nn.RNN`, `nn.LSTM`, `nn.GRU` now support mini-batches where sequences are of variable
You can pass an input of type [`PackedSequence`](
into these layers.
A `PackedSequence` holds data and a list of sequence sizes of a packed sequence batch.
For example, a `PackedSequence` will hold an input mini-batch of such sequences:

a b c d e
a b c d e f g h
a b
a b c d

Here, each input row is of variable length.

You can construct a `PackedSequence` using the provided function

`pack_padded_sequence` takes a `Variable` containing padded sequences, i.e. a `Tensor`
of `T x B x *`, where `B` is the size of the mini-batch, and each input is either of
length `T` or is padded to length `T`. It also takes a list of lengths of each input.
From these, it constructs a `PackedSequence`

For example, it will take [8, 5, 4, 2] and and an input `8 x 4 x 128`
that corresponds to:

a b c d e f g h
a b c d e 0 0 0
a b c d 0 0 0 0
a b 0 0 0 0 0 0

The output of the RNN layers will also be a `PackedSequence`, which can then be inverted
back to a padded Tensor using the inverse function:

Sparse Tensors (CPU)
Original goals:
- ability to propagate sparse updates in a network (e.g. for updating an embedding matrix)
- ability to efficiently compute "bag-of-words" sentence embeddings (e.g. weighted average of word embeddings)

Implemented features:
- enable backpropagation of sparse gradients without conversion to dense tensors. In most cases a runtime exception is thrown when mixing different gradient types for the same variable
- add some methods for `THSTensor`: `zero`, elementwise `add` and `mul`, scalar `mul` and `div`
- make `addcmul` method of `THTensor` compatible with sparse operands
- make `spmm` method accessible from Python as `dsmm`
- `sparse_mask` method on `THTensor`. This produces a sparse tensor from a dense tensor,
by using a sparse tensor as a mask. A value is only present in the output sparse
tensor if it also exists in the mask.
- update `optim.Adagrad` to use sparse updates when possible.
- **leave `Variable`'s gradient to `None` by default.**
This is because there is no canonical zero gradient anymore (it could be dense or
sparse, and if it is sparse we don't know how many dimensions are sparse)
- N-dimensional values for sparse tensors:
- Basically for things like applying sparse updates to embedding matrices, only the
first dimension (the one that corresponds to the word index) is sparse. The other
dimension is always dense (only whole embedding vectors are updated). An elegant
solution is to make the `values` tensor N-dimensional instead of 1-dimensional.
For an embedding matrix, the sparse gradient will have a `values` tensor of
size `nnz * embedding_size` instead of just `nnz`.

Common weight initialization methods for neural networks
By default, all `Linear` and `Conv` layers in PyTorch are initialized according to
a scheme proposed by [LeCun'98](

However, there are several other commonly used initialization methods.
We now support many other methods via `torch.nn.init`.
Supported methods include:
[`uniform`, `normal`, `constant`, `xavier_uniform`, `xavier_normal`, `kaiming_uniform`,
`kaiming_normal`, `orthogonal`, `sparse`](

Here's an example of using these initialization methods:
import math
from torch import nn

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(5, 10, (3, 3))
nn.init.xavier_uniform(self.conv1.weight, gain=math.sqrt(2.0))
nn.init.constant(self.conv1.bias, 0.1)

network = Net()

Other features
- Added a gradient checker utility `torch.autograd.gradcheck` that can
be used to check your implementations. Here's a small example:
from torch.autograd import Variable, gradcheck
inputs = Variable(torch.randn(4, 4), requires_grad=True)
gradcheck(lambda x: 2*x.diag(), (inputs,), eps=1e-3)

- Add a [clip_grad_norm]( utility to easily clip gradients via constraints on their norms.
- Document `nn.ModuleList` and `nn.ParameterList` that are immensely useful when
storing a list of modules in a `Container`
- Optimizers have backward-compatiblity for old checkpoints.
`__set_state__` and `__get_state__` introduced into optimizers.
- Add Nesterov momentum to `optim.SGD` via [`nesterov=True` kwarg](
- DataParallel supports multiple inputs and keyword args (which are also scattered)

m = nn.DataParallel(model)
Now valid
m(x, y, option=z)

See the [documentation]( for exact behavior.
- DataLoader's `default_collate` now also supports numpy arrays
- Added `F.pad` that supports Constant, Reflection and Replication padding in a single
interface: [](
- `train()` now optionally supports a boolean argument. For example `model.train(False)`
will set it to `eval` mode and `model.train(True)` sets it to `train` mode.
- Added a `DataLoader` sampler: `SubsetRandomSampler`that takes a list of indices
in it's constructor and randomly samples from these indices. Useful when you
want to sample only a particular subset of your dataset.
- Transpose supports negative dimensions. For example:
a = torch.randn(2, 3)
b = a.transpose(0, 1)    both are equivalent
b = a.transpose(-2, -1)  both are equivalent

Performance Improvements
- CPU Tensor backend gets faster
- Explicit AVX, AVX2 and improved SSE intrinsics to speedup copy, fill, add, mul, div
- Much improved speed for all apply and reduce operations to have better cache hits
- Added OpenMP in TH_TENSOR_APPLY* operations
- Overall, 2x to 10x+ faster on a lot of operations, closer to Numpy speeds
- Runtime dispatch of intrinsics based on CPU features (easy to ship binaries)
- Serialization Improvements
- Fixed bugs on serialization for Tensors > 2GB
- 5x to 10x faster serialization (no longer Tarring Tensors)

Bug Fixes
- Multi-GPU CuDNN RNN now has separate dropout descriptors per GPU
- NLLLoss2d has proper shape checks on GPU and stable sizeAverage formulation
- LogSoftmax2d has a more stable formula
- Fix prodall (prod without dim arguments) to not average
- Return correct number of gradients from cuDNN RNN
- NLLLoss2d has support for weights
- Fix Unpooling bug for MaxPool1d
- Fix Indexing when using only an ellipsis
x = torch.randn(2,2,2,2)
x[...]  used to fail, fixed now.

- expose stateless methods (`torch.*`` methods) for `torch.cuda.HalfTensor`
- Prevent creation of reference cycles (and hence improve memory usage) when
leaf variables were using in-place operations.
- Fix gradient computation for the indexing operation in the case of sending in
- Fix a reshaping bug in the grad_input of basic operations such as `+, -, *, /` etc.
This used to fail, but is fixed now:
x = Variable(torch.randn(4, 6), requires_grad=True)
b = Variable(torch.rand(12, 1) + 1e-2, requires_grad=True)
(x +, 2) + 1e-2))).sum().backward()

- Revert partial indexing with `LongTensor` to return to numpy-compatibility
- References to some Tensors in `BatchNorm` and `Conv` are now freed to improve
memory usage in certain situations. ResNet-152 finetuning with batch_size 16
used to consume the same amount of memory as batch 256 after this fix.
- Fix a bug where `requires_grad` was being propagated forward differently in
CPU mode and CUDA mode.
- Fix bugs in `torch.multinomial` on CUDA, where in rare cases, the sampling
lead to nonsensical values
- Allow backprop through CuDNN RNN in `eval()` mode.
- Support `np.int16` in conversion to `ShortTensor`
- Enable multithreading in MKL (was disabled previously due to a cmake bug).

Improved error messages
- Print a readable error message when arguments are on different GPUs
- Add better error message for conversion of CUDA tensors to numpy
- Add checks for reward type and size in StochasticFunction


Bug fixes:
- Major bugfix in CuDNN bindings for cases of non-contiguous grad-outputs
- also added better error checking and asserts to cudnn RNN and Conv
- Fixed serialization bugs when serializing Tensors > 2GB
- Enable and half and double THNN backends
- RNNBase and Embedding fixed to be compatible with DataParallel
- Fix bug in for multi-GPU settings
- Support bias=False in Conv3d
- Change behavior of `detach()` to actually remove the creator (previously was just detaching compute)

Features and performance
- Refactored autograd internals into python-agnostic C++ (662)
- view, unsqeeze and squeeze moved to C for superior performance
- Allow DataParallel to have tuple inputs
- Add a `torch.__version__` string.


A bugfix release with some small features:

New Features
- THPP now has CUDA Tensors
- autograd functions: repeat, var, std, renorm, comparison ops added.
- Merged an initial version of THD (distributed pytorch)
- Indexing support with LongTensor indices
- Add torch.unbind
- Add `ModuleList` and `ParameterList` to store lists of modules / params in an `nn.Module`

Bug and usability fixes
- Fix a bug in FFI utils
- Fix lua-reader for SpatialConvolution
- Fix backward contiguous check in BatchNorm
- Fix travis builds
- Pep8 enforced for the entire codebase
- CuDNN RNN non-contiguous fixes
- Remove circular references in some Autograd functions
- Add CUDA asserts to various kernels for out-of-bounds checks
- Fix non-contiguous bug in
- Fix memory leak in Unpooling

API Changes
- nn.Billinear\* -> nn.Bilinear*
- Return indices as well in autograd for `torch.sort` and `torch.topk`
- `.set_index` -> `._set_index` (made private)
- `normal` and `log_norma`l kwarg changed from `var` to `std`
- `Optimizer.state_dict` now has semantics matching `Module state_dict`


A bugfix release with some small features:

New Features
- LBFGS Optimizer added
- Add `state_dict` for optimizers for easy checkpointing
- Add differential upsampling modules for 2d (bilinear, nearest)

Bug and usability fixes
- Fix multi-GPU bugs in indexing
- Improve error messages for optimizer
- Fix bug in Conv1d
- Fix bug in Conv*d groups
- Add improved error messages for unsupported CuDNN codepaths
- fix bugs in CuDNN bindings
- Workaround bugs in CuDNN itself (batchnorm-backward, non-contiguous weights)
- Fix lua-reader's BatchNorm and Linear layers
- Fix some memory leaks
- Give fatal errors on Variable comparison
- Fix bug in ELU backward
- Fix index_select backward
- Fix BatchNorm backward in evaluate mode (workaround CuDNN bug)

API Changes
- Adadelta's `step_rate` is renamed to `lr`
- Adam's default learning rate the same as LuaTorch


Our last release (v0.1.5) was on November 14th, 2016

We finished, froze and released (v0.1.6) on Jan 21st, 2016.

A lot has happened since 0.1.5.

- PyTorch public release on 18th Jan, 2016.
- An initial Model Zoo, several common Vision models can be initialized with pretrained weights downloaded from the zoo.
- All the 100+ torch.\* functions bar 3 (topk, mode and kthvalue) are GPU-ready, and performance improvements across board for several existing ones.
- All relevant neural network modules are now CuDNN bound.
- Stochastic functions added to Autograd, for use in reinforcement learning
- A functional interface of the nn library is added
- GPU device initialization has been made lazy (improvement in CUDA initialization time on multi-GPU machines)
- Pinned memory support, and leveraging it in DataLoader
- Made error messages across board more informative, especially around shape checks
- A rich set of examples and tutorials added to pytorch/examples and pytorch/tutorials
- API Reference at
- Multiprocessing support for CUDA (Python3 only)
- An initial version of CPU Sparse Tensors is added and used in nn.Embedding(sparse=True). More to come on this side.
- Added a lua reader to load existing .t7 files with Torch models
- Various bug-fixes.
- Allow returning of changed gradients in hooks

API Changes
- `Conv*d` and `*Pool*d` layers now take a tuple of kernel sizes/strides/padding instead of `kh`/`kw`.
- `Unpooling*` layers have a changed API
- `Variable.grad` is now a `Variable` (was a `Tensor`)
- `nn.Container` is deprecated and merged into `nn.Module`. Replace all instances of `nn.Container` in your code with `nn.Module`
- `` changed API to take an iterable of tensors, along with a dimension (previously varargs of Tensors). Also ``'s default dimension is changed. It's been made an inverse transform for `torch.split` and `torch.chunk`.
- `Variable.no_grad` has been renamed to `Variable.detach`
- RMSProp's initialization of gradients changed from ones to zeros (485)
- Removed `cmin`, `cmax` and `cinv` (functionality of `cmin`, `cmax` split between `max`/`min` and `clamp`; `cinv` renamed to `reciprocal`)
- `register_hook` API changed, names are removed. See:
- `torch.*(..., out=Tensor)` is adopted for output arguments

Model Zoo

A model zoo has been started with several pre-trained vision models available such as AlexNet, ResNet50, etc. The download and usage of the models is seamless with a keyword argument.

import torchvision.models as models

The models are hosted on Amazon S3, and we look forward to more models from the community.
Basic documentation is found here:

You can find specific models listed in the README of torchvision and torchtext

Stochastic Functions in Autograd

We introduced Stochastic functions that needed to be provided with a `reward` for their backward.
This feature was inspired by [Gradient Estimation Using Stochastic Computation Graphs by Schulman et. al.]( and is helpful to implement reinforcement learning techniques.
Documentation is here:
A showcase of using these nodes is in the REINFORCE example:

Functional interface to nn

PyTorch neural networks have so far been modeled around `nn.Module`. However, for most simple functions such as ReLU, using this is a bit cumbersome.
To simplify this, we've introduced a functional interface to nn, and modified the tutorials to use this API where appropriate.

For example:

import torch.nn as nn
import torch.nn.functional as F

module style
relu = nn.ReLU()
y = relu(x)

functional style
y = F.relu(x)

The functional style is convenient when using non-parametric and non-learnable functions.

Documentation for these functions is here:

Faster GPU code

The initialization of the GPU backend has been made lazy. This means that it will automatically be
imported and initialized when needed (and not before-hand). Doing this has improved startup times (especially for multi-GPU systems) and reduced boilerplate code.

We've also integrated support for pinned memory, which accelerates CPU to GPU transfers for specially marked buffers. Using this, we accelerated the multiprocessing data loaders.

A rich set of examples

With the help of some of you, we've added a rich set of examples from Image Super-resolution to Neural Machine Translation.
You can explore more here:

API Reference and Notes

We've fleshed out a full API reference that is mostly complete at
Contributions are welcome :)

We've also added notes such has CUDA Semantics, Extending PyTorch, etc.

Multiprocessing support for CUDA

Uptil now, Tensor sharing using multiprocessing only worked for CPU Tensors.
We've now enabled Tensor sharing for CUDA tensors when using python-3.
You can read more notes here:

Lua Reader

A "lua reader" has been integrated, that can load most LuaTorch .t7 files, including `nn` models.
nngraph models are not supported.

Example usage can be found here:


What's new in Alpha-5?

- keyword arguments, improved indexing for all torch and autograd functions!
- Deterministic data loader even under multiple workers
- LAPACK bindings with full CUDA support via MAGMA
- Easier numpy2torch conversion with torch.from_numpy(x)
- Lot more documentation
- fully covered neural networks
- fully covered optim package
- partly covered torch documentation
- Tutorials:
- Increased depth, length and clarity of the tutorials

New Features and modules
- PyTorch Vision: a package to hold common dataloaders, transforms and utilities for images and videos
- Data loaders for: COCO (captioning and detection), Imagenet, CIFAR10/100, LSUN etc.
- Image Transforms: commonly used data augmentation transforms such as random-cropping, normalization
- Unit-tested
- Utilities: saving Tensors as images, creating grids of images from a mini-batch of tensors.
- Recurrent Neural Networks
- A complete and robust implementation of efficient Stacked LSTMs, RNNs, GRUs (bidirectional and otherwise)
- Seamlessly integrated CuDNN is used whenever possible for maximum performance
- A complete word-level language modeling example on the PennTreeBank dataset
- verification that the perplexity matches the reference Torch implementation
- an example of Generative Adversarial Networks:
- DCGAN example in < 250 lines (includes everything)
- Verified the results to match reference implementations
- Multi-GPU ready!
- A redesigned Optim package with the following optimization methods:
- SGD, AdaDelta, Adagrad, Adam, AdaMax, Averaged SGD, RProp, RMSProp
- Fully unit tested against their reference implementations
- Fully documented
- Improved Multi-GPU performance (and more is coming)
- Integrated NVIDIA NCCL for maximizing multi-GPU communication performance

Plans for Alpha-6
- docstrings support and finishing torch and autograd documentation
- Fully verifying the convergence of ResNet / Imagenet training
- More examples around:
- Reinforcement Learning / OpenAI Gym
- Object Detection
- Sequence to Sequence methods
- WaveNet / ByteNet
- More adversarial networks (text2image, etc.)
- More gains in performance, and fully flesh out CuDNN integration
- Half-precision training for GPUs
- A Lua-Torch model loader, and improved legacy.nn support
- Lua bridge, to call your existing lua code


Keyword arguments

All torch and autograd functions used to only support arguments in the correct order.
For example:



Some interesting stats

On Resnets

Because of our aggressive freeing and allocating resources, ResNets in PyTorch take lesser memory than torch-nn
- 4.4GB in PyTorch
- 6.5GB in Torch-nn
- 4.6GB in Torch-nn with a hacky sharing of gradinput buffers
- On 1-GPU, PyTorch speed is 10s of milliseconds faster than Torch-nn
- On 2-GPUs, PyTorch is the same speed as Torch-nn
- On 4-GPUs, PyTorch is about 10 to 20% slower, but it's because we have just finished implementing Multi-GPU and we will be plugging this perf difference in the next week.

FFI-based C extension

On a small benchmark of adding a constant to a 5x5 tensor at 1000 calls:
- LuaJIT FFI: 0.001 seconds
- Lua 5.2 FFI: 0.003 seconds
- PyTorch CFFI: 0.003 seconds
- Raw Python CFFI / CTypes: 0.001 seconds

What's new in Alpha-4?

- Two Tutorials, now located at: [](
- Tutorial 1: [Introduction to PyTorch for former Torchies](
- Tutorial 2: [Write your own C code that interfaces into PyTorch via FFI](
- Examples:
- A full Imagenet / ResNet example is now located at:
- it works! :)
- Has performant Multi-GPU support
- More improved error messages and shape checks across the board in pytorch, TH, THNN
- `torch.*` functions now dont use `CamelCase`, but use `underscore_case`. Example: `torch.index_add_`

New Features and modules
- Multi-GPU primitives
- A custom CUDA allocator to maximize autograd performance (backported to Torch too)
- More autograd functions. Now it's almost API complete for all differentiable `torch.*` functions.
- CuDNN Integration
- Multiprocess DataLoader in `torch.utils` (used in the imagenet example)
- Extensions API to interface to your C code simply via FFI
- [An example extension is provided here](

Plans for Alpha-5
- Revamping and rethinking the Checkpointing API
- Revamping the Optim API to support things like per-layer learning rates and optimizing non-weights (like in NeuralStyle)
- RNN Examples, initially for PennTreeBank language modeling
- Better RNN support in general, improved error messages, multi-GPU etc.
- NCCL Integration for improved multi-GPU performance (already implemented at )
- Documentation / Reference manual for `torch.*` and `autograd`



We've added two tutorials to get you all started.
- Tutorial 1: [Introduction to PyTorch for former Torchies](
- In this tutorial we cover the torch, autograd and nn packages from a perspective of former Torch users.
- Going through this tutorial should get you started. Let us know how we can improve it.
- Tutorial 2: [Write your own C code that interfaces into PyTorch via FFI](
- In this tutorial, we showcase how you can call your own C code that takes torch tensors as inputs / outputs in a seamless way via FFI
- The tutorial showcases how you can write your own neural network Module that calls in C implementations


We've added a full imagenet example with ResNets that should be really suited towards “learning by example”.
It is located here: [](
The data for the example has to be preprocessed for now in the same way as is specified in [fb.resnet.torch](

The example has Multi-GPU support in a DataParallel fashion.

More improved error messages

We've gone through the TH and THNN C libraries and added much more intuitive error messages that report the mismatched shapes. We will continue to make improvements on this front.
If you have any unintuitive error messages that you encounter, please open an issue at

For example:

Old error message:

bad argument 2 to 'v' (3D or 4D (batch mode) tensor expected for input

New error message:

bad argument 2 to 'v' (3D or 4D (batch mode) tensor expected for input, but got: [100 x 100]

No more CamelCase for functions

All torch functions have been renamed from CamelCase to underscore_case.
indexAdd → index_add_
getRNGState → get_rng_state

New Features and modules

Multi-GPU primitives
- We've added efficient multi-GPU support in general for neural networks. Instead of building magic blocks that do opaque parallelization for you, we've broken them down into easy to use collectives.
- A pattern like DataParallel is implemented in terms of:
- replicate, scatter, gather, parallel_apply
- These are reusable collectives for implementing other multi-gpu patterns as well


With Multi-GPU, we naturally overlap data transfers with compute across the whole graph. This makes multi-GPU much more efficient, and is done in a way that does not interfere with the imperativeness / error reporting.

Another important note is that we now dispatch parallel modules via python threads, which makes the CUDA kernel launches in a breadth-first fashion, getting rid of obvious kernel launch latency bottlenecks.

Custom CUDA allocator to maximize autograd performance

In Torch, we had to write nn modules in a careful way to avoid cuda synchronization points which were a multi-GPU bottleneck and general performance bottleneck. This affected neural networks and autograd sometimes up to 2x in performance penalty.

In PyTorch (and Torch), Sam Gross has written a new Caching CUDA allocator that avoids cuda synchronization points while being really suited towards Tensor use-cases where we typically do short-term and long-term allocations of memory of the same tensor sizes.

This unblocks us from a lot of performance issues.

More autograd functions

Now the torch.\* API should be pretty much be ready for full autograd support (short of 3 functions).
Autograd has been enabled for all the functions with the exception of non-differentiable functions like torch.eq.

CuDNN Integration

We now fully integrate and support CuDNN version 5.1.3, and it is shipped in the binaries (just like CUDA), so you never have to worry about manually downloading and installing it from the NVIDIA website.

Generic Multiprocess DataLoader

We've added a flexible Data Loader that supports multiple data loading workers. This enables a lot of use-cases, and is first used in our Imagenet example.

C Extensions API

We added an easy to use extensions API and an example extension here:

You can call your C functions (that have TH*Tensor inputs / outputs and other fundamental types in the function signature) without writing any manual Python bindings.

One question you might have is, what kind of call overhead these auto-generated FFI bindings have. The answer is “None”, as seen in the numbers in the beginning of the note.

The example extension also covers how you can define your autograd-ready nn module that calls your C function.


What's new?

- conda binaries for all Linux (as old as RHEL 6 and  Ubuntu 12.04) (we are working on OSX and pip binaries).
- Now installing pytorch is as simple as:
- `conda install pytorch -c`
- it links against MKL, ships CUDA and MAGMA runtime with it, and justworks
- Human-ready error messages
- Started working on documentation and an API Reference
- (
- Continuous integration with GPU support. Never have a broken master again
- (

New Features and modules
- The (new) neural network module now has 75% of the modules implemented (71 out of 93), and we are powering through the rest
- most of the modules in old-nn have been removed because we do not need Containers and many modules such as CAddTable are covered by Autograd
- autograd now supports all torch functions present in twitter-autograd and a lot more....
- Added Trainer and Dataset abstractions (like in TorchNet)

Plans for Alpha-4
- cudnn integration (and CUDA allocator).
- We have this implemented but are iterating over design
- Multi-GPU support in nn
- examples, examples, examples
- we will work on having examples across all domains (vision, NLP, RL, etc.)


Conda binaries for Linux

PyTorch will be shipped on Linux and OSX (and likely Windows) from the day-1, and we want it to be as simple and intuitive install process.
We have versioned binaries, that do not require the user to install anything (except an NVIDIA Driver if you intend to use the GPU. Not even CUDA is a dependency).

For now, to get started on Linux:

conda install pytorch -c

We have built OSX binaries, but have some small bugs on OSX, and we'll fix the issues there over the week.
We are working on “pip install” for non Anaconda python installs.

Human-ready error messages

We've gone through how we report type errors and dispatch errors and make it easy for the user to understand what they did wrong. See this small example:

In [1]: import torch
In [2]: x = torch.FloatTensor(10)
In [3]: x.addmm(torch.ones(1), 1, 'str')
ValueError                                Traceback (most recent call last)
<ipython-input-3-90eb50ea2e35> in <module>()
----> 1 x.addmm(torch.ones(1), 1, 'str')

ValueError: addmm recieved an invalid combination of argument types - got (torch.DoubleTensor, int, str), but expected one of:
* (torch.FloatTensor mat1, torch.FloatTensor mat2)
* (float beta, torch.FloatTensor mat1, torch.FloatTensor mat2)
* (float beta, float alpha, torch.FloatTensor mat1, torch.FloatTensor mat2)

Continuous Builds with GPU support
- All pushes to the _master_ branch are fully built and unit tested
- All Pull Requests are fully built and unit tested
- On Titan-X GPUs in the NIMBIX cloud
- One can go checkout the builds details at: (

New Features and modules

Neural Network Modules
- Added fully functional and fully unit-tested nn modules and criterions for pretty much everything one would need for their current workflows.
- We have about 25% of the modules missing (mostly exotic and lightly used ones) but will get to those in the coming few days.
- nn modules have been renamed to be simplified in their naming. For example:
- SpatialConvolution → conv2d
- The new naming can be referenced at ( or via autocomplete.
- Full unit-test coverage for all implemented functions

- We've added autograd support for almost all the torch functions (and operators like +, - etc.)
- We have all the functions implemented that are presented in twitter-autograd, and we have many more.
- At this point we have about 75 to 80% of them covered (ball park).
- Full unit-test coverage for all implemented functions

Trainer & Dataset classes


We've added a TorchNet style _Trainer_ class that provides a convenient abstraction

trainer = Trainer(model, criterion, optimizer, dataset)
trainer.register_plugin(Logger(['progress', 'accuracy', 'loss'], interval=(5, 'iterations')))

progress: 180/60000 (0.30%)     accuracy: 0.00% (3.24%)         loss: 2.3051 (2.2116)
progress: 280/60000 (0.47%)     accuracy: 5.00% (4.84%)         loss: 2.3045 (2.2891)
progress: 380/60000 (0.63%)     accuracy: 25.00% (13.04%)       loss: 2.2974 (2.2992)


The data loading is implemented using three abstractions:
- _DataSource_ - a simple object that defines indexing and checking length. Indexing returns a tuple of (sample, label)
- _Sampler_ - an object that defines the data ordering. it has to be iterable, and it’s iterator should return a string of indices in [0; len(data_source)-1] interval. The end of the iterator indicates completing the epoch.
- _Dataset_ - an object which wraps a DataSource and a Sampler. Defines all the data loading logic (e.g. all the multiprocessing code).

The Datsets will accept a list of transforms (like image augmentation) that are given to it, which will run on the data before given out.


What's new?

- built seamless support for multiprocessing with Tensor sharing
- changed the API of the optim engine
- added a complete Hook system for nn and autograd
- added in-place ops to autograd and more neural network modules to nn

Multiprocessing with Tensor sharing

In Torch, or in general, one uses "threads" to build parallel data loaders, as well as to do Hogwild training.
Threads are powerful, as one can share Tensors between threads.
This allows you to:
- transfer data between threads with efficiently with zero memory copy and serialization overhead.
- share tensors among threads for parameter sharing models

Sharing Tensors among threads is very useful when you do Hogwild training, i.e. if you want to train several models in parallel, but want to share their underlying parameters.
This is often used in non ConvNets, like training word embeddings, RL-for-games, etc.

With Python, one cannot use threads because of a few technical issues.
Python has what is called [Global Interpreter Lock](, which does not allow threads to concurrently execute python code.

Hence, the most pythonic way to use multiple CPU cores is [multiprocessing](

We made PyTorch to seamlessly integrate with python multiprocessing.
This involved solving some complex technical problems to make this an air-tight solution, and more can be read [in this in-depth technical discussion](

What this means for you as the end-user is that you can simply use multiprocessing in this way:

Functions from this file run in the workers

def fill(queue):
while True:
tensor = queue.get()

def fill_pool(tensor):

Example 1: Using multiple persistent processes and a Queue

import torch
import torch.multiprocessing as multiprocessing
from loaders import fill

torch.multiprocessing.Queue automatically moves Tensor data to shared memory
So the main process and worker share the data
queue = multiprocessing.Queue()
buffers = [torch.Tensor(2, 2) for i in range(4)]
for b in buffers:
processes = [multiprocessing.Process(target=fill, args=(queue,)).start() for i in range(10)]

Example 2: Using a process pool

import torch
from torch.multiprocessing import Pool
from loaders import fill_pool

tensors = [torch.Tensor(2, 2) for i in range(100)]
pool = Pool(10), tensors)

Optim's API changes

Optimizer's step function now accepts a closure that should return a loss variable (similar to `legacy.optim`).

We've realized that to keep Optim flexible for multiple methods, like SGD with nesterov, Conjugate Gradient, LBFGS etc., we need to have the input to optim be a function that evaluates the model.
This is necessary because several optimization methods re-evaluate the function multiple times at different parameters.
To come to this necessary API change, we took into account complicated scenarios like Dynamic RNNs and complex ConvNet models with dynamic branching.

So the API now looks like this:

optimizer = optim.SGD(model, lr=1e-3, momentum)
input, target = ...
optimizer.step(lambda: criterion(model(input), target)) sufficient for simple models

To simplify things at the user end for simple or specific common models, we will introduce a Trainer class, that will take a (dataset, model, optim) triple and train the model. This trainer class is planned for alpha-3.

A complete Hook system for nn and autograd

Accessing intermediate values during the forward pass is straightforward, but during backward the buffers can rapidly change their content (for example: when doing in-place optimizations).

If you want to get access to the gradients at a particular Op or Layer inside your model, one uses a hook system.
Hooks can be attached to variables or to modules and are called as soon as the gradient is available:

Example in autograd
a, b, c = [Variable(torch.Tensor(5, 5)) for i in range(3)]

def print_norm(grad):

y = b * c + a

z = y * y - b
z.backward(torch.ones(5, 5))

Example in nn
model = ...

def inspect_forward(module, input, output):


def inspect_backward(module, grad_input, grad_output):


We would definitely look forward to comments about the Hook system. Let us know what you think.

Added in-place ops to autograd and more neural network modules to nn
- As part of porting fb.resnet.torch, we've added AveragePool2d and fixed BatchNorm2d
- Now, autograd fully supports in-place operations, with in-place variables immediately marked as dirty.
To illustrate this, let's look at a small example

x = Variable(torch.ones(5, 5))
y = Variable(torch.ones(5, 5) * 4)

z = x * y
q = z * y
r = z + y
z is a the last expression, so this should succeed
z.backward(torch.ones(5, 5))

r doesn't use the z in it's backward, so it should succeed
r.backward(torch.ones(5, 5))

however, q needs z in it's backward, but z has now been
marked as dirty (because it was used in an in-place operation)
this line will hence raise an error
q.backward(torch.ones(5, 5))

Plans for alpha-3
- Unit tests for multiprocessing
- Add more nn modules and autograd functions ( we're porting fb.resnet.torch )
- New CUDA memory allocator (non-synchronizing CUDA tensors allocations)
- We've made progress on this, but it is not complete yet
- Trainer and Dataset classes
- Continuous builds for CUDA (using Nimbix)
- Binary packages (nightly and versioned)


It's been a week since pytorch alpha-0.
We're excited to now present alpha-1 :)

What's new?

We've built a working and unit-tested version of the new nn and autograd packages (torch.nn, torch.autograd) along with a basic draft optim package (torch.optim). The old packages will continue to be available at torch.legacy.*

We've also built fully working serialization ( / torch.load) with features that one expects out of the box like sharing staying intact.

At this point, you can play around with things and get a feel of the new design.

There's an MNIST example at

A concern raised about pytorch was that Python is a slow language.

It turns out that the MNIST example runs in exactly the same amount of time / epoch in both pytorch and (lua)Torch, and we haven't yet done any optimizations in the code in pytorch yet.

Another notable thing is that pytorch uses 1500MB of system memory vs (lua)Torch's 2300MB. This is before we've added any in-place optimizations into pytorch. The design of the new nn allows us to add seamless memory optimizations without needing the user to mark things as in-place or out-of-place which will bring us more seamless memory savings in pytorch.

More verbosely:


We've published an early version of the new nn package.
There are only a few modules right now, but we'll be adding more soon.

There are a couple of advantages over to old package:
- Modules no longer hold temporary buffers and short-lived state. This allows to use the same module a number of times in forward pass, and the gradients will be summed automatically. For example, see how we use the same nn.ReLU object multiple times over here:
- There's no longer any need for using rigid container modules. Your model is defined by your code. You can select a completely different path across your model just by adding a number of `if`s. Any crazy branching schemes inside your model are allowed by design.
- It's fully compatible with autograd. Instead of using `nn.Add` or `nn.Index` you can just write this in your model definition: `y = module1(x_1)[0] + module2(x_2)`.
- You can register both forward and backward hooks at each module, which allow you to inspect the intermediate outputs and gradients flowing through the network and the graph.
- [Not Yet Implemented] Safe in-place operations. Tensors used in in-place operations are marked as dirty, and trying to use them in any way raises an error.


Autograd at the core of pytorch. Enabling it is just a matter of wrapping your tensors in `Variable` objects before starting the computation (`x = Variable(x)`). Then, when you have your output you can either call `y.backward()` if it's a scalar, or provide gradient w.r.t. the variable as an argument (`y.backward(grad_output)`). Gradients w.r.t. variables are then available in their `.grad` attributes. Please note that only gradients of leaf variables (i.e. created by the user) are computed. If you want to access any gradients of intermediate values, you'll have to use a hook system.

If you don't want to compute gradient for some variables, you can even mark them in a constructor with `requires_grad=False`, and they will be optimized out from the backward pass.


_Please note that this api is still a bit experimental, and is likely to undergo changes soon._

optim has a different, more object oriented API. First, you have to create an optimizer object `optimizer = optim.sgd(model, lr=1e-3, momentum=0.9)`. If you don't want to merge the model and criterion in a single object, it's also possible to pass a tuple of `(model, criterion)` as the first argument to a constructor. Then, in your training loop you just call `loss = optimizer.step(input)` (in case of separate model and criterion input should be a tuple of `(input, target)`). This accumulates all the gradients and performs a single optimization step on the parameters.


Tensors supported `pickle` protocol since the beginning of alpha, but pickle can't handle storage/data sharing properly and requires all the data to be copied before serialization.
We've created `torch.load` and ``, that have the same interface and solve both of these problems.

Tensor operators

Thanks to bart we've added support for `` operator for matrix multiplication, and changes the `*` to elementwise multiplication.

Plans for alpha-2:
- Hook system for nn and autograd (for accessing intermediate values)
- More nn modules, autograd options, and optim algorithms
- Inter-process sharing of tensors (for multiprocess data loading or hogwild training)
- New CUDA memory allocator (non-synchronizing CUDA tensors allocations)


This is often unreadable, especially for LAPACK usage where one declares booleans such as upper=True

Now, one can simply do:

torch.clamp(x, min=-0.1, max=0.1)

We've also implemented ellipsis indexing similar to NumPy

Deterministic Data Loader

The data loader now generates indices on the main process and regardless of how many workers you use,
the order of data loading will remain consistent if you use the same random seed.

Fully tested LAPACK bindings

Unit tests on both the CPU and CUDA side.
On the CPU, we ship with MKL-integration, and on the GPU, LAPACK is powered by MAGMA


We are at a stage where we have converged to stable APIs.
Hence, documentation is going at a rapid pace, and we have covered:
- nn
- optim
- part of torch / Tensors

As always, you can check out the documentation here: [](


We added one new tutorial: **[Creating extensions using numpy and scipy](**
- This covers the case where you would want to quickly write some modules of your neural network using familiar scipy tools like scipy.sparse for example.

We improved the existing tutorials to cover more of the basics, and improved them.

New Features and modules

PyTorch Vision

A one-stop repository for all of your image (and soon) video needs, whether that be data loaders, common neural network definitions (such as alexnet, inception, resnet etc.) or data augmentation routines.
Our plan is to put some serious engineering firepower into this module, with GPU loaders and augmentation routines, especially for video processing. Contributions welcome :)

So far, we have:

Data loaders
- COCO (Captioning and Detection) (
- LSUN Classification (
- ImageFolder (
- Imagenet-12 (
- CIFAR10 and CIFAR100 (

All the data loaders are fully documented, and share a basic interface.
They are fully compatible with torch.utils.DataLoader to be parallelized in fetching.

Common Image Transforms
- Convertors from PIL Image to Torch Tensors
- Random Cropping, Scaling, Normalization transforms
- Unit tested

**The Imagenet example has been updated to use this package**

Recurrent Neural Networks

One of the biggest strengths of PyTorch's new design is to be able to seamlessly share weights and do recurrent nets.
We've emphasized this, and also deeply integrated CuDNN in a way that as a user you do not notice a thing, while having the full power and speed.

nn.RNN, nn.LSTM and nn.GRU are the stacked RecurrentNet modules that you would want to use, and for generally crazy research, we've also given implementations of individual cells: nn.LSTMCell and nn.GRUCell

A fully tested and verified example is provided in
This example does word-level language modeling on the PennTreeBank dataset.

Adversarial Networks

A concise example of Generative Adversarial Networks for Image Generation is provided, integrating multiple datasets (showcasing the power of the vision package).
The example is < 250 lines of code, and gives a lot more clarity towards the usage of PyTorch.
Multiple data loader threads, checkpointing, saving generated images to disk and much more is showcased.

A stable and fleshed out Optim package

It took us some time to design a good and stable Optim API, but now we have converged to a clean design.
The Optim package is fully Multi-GPU and Multi-device ready out of the box.
Now we've implemented and unit tested the following algorithms:
- SGD, AdaDelta, Adagrad, Adam, AdaMax, Averaged SGD, RProp, RMSProp

Setting per-layer learning rates, or optimizing only part of your neural network is now very trivial.

It is fully documented here:
It's usage can be seen both in the DCGAN and Imagenet examples.

Improved Multi-GPU performance (and more is coming)

We've improved the Multi-GPU performance since alpha-4, and we are close to squeezing out full performance.
We are working closely with NVIDIA to squeeze out the last drops of performance and make PyTorch future-proof for the P100 and new cards.