Changelogs » Torchvision









New Datasets

* Add Caltech101, Caltech256, and CelebA (775)
* ImageNet dataset (764) (858) (870)
* Added Semantic Boundaries Dataset (808) (865)
* Add VisionDataset as a base class for all datasets (749) (859) (838) (876) (878)

New Models


* Add GoogLeNet (Inception v1) (678) (821) (828) (816)
* Add MobileNet V2 (818) (917)
* Add ShuffleNet v2 (849)  (886) (889) (892) (916)
* Add ResNeXt-50 32x4d and ResNeXt-101 32x8d (822) (852) (917)


* Fully-Convolutional Network (FCN) with ResNet 101 backbone
* DeepLabV3 with ResNet 101 backbone


* Faster R-CNN R-50 FPN trained on COCO train2017 (898) (921)
* Mask R-CNN R-50 FPN trained on COCO train2017 (898) (921)
* Keypoint R-CNN R-50 FPN trained on COCO train2017 (898) (921) (922)

Breaking changes

* Make `CocoDataset` ids deterministically ordered (868)

New Transforms

* Add bias vector to `LinearTransformation` (793) (843) (881)
* Add Random Perspective transform  (781) (879)


* Fix user warning when applying `normalize` (810)
* Fix logic error in `check_integrity` (871)


* Fixing mutation of 2d tensors in `to_pil_image` (762)
* Replace `tensor.view` with `tensor.unsqueeze(0)` in `make_grid` (765)
* Change usage of `view` to `reshape` in `resnet` to enable running with mkldnn (890)
* Improve `normalize` to work with tensors located on any device (787)
* Raise an `IndexError` for `FakeData.__getitem__()` if the index would be out of range (780)
* Aspect ratio is now sampled from a logarithmic distribution in `RandomResizedCrop`. (799)
* Modernize inception v3 weight initialization code (824)
* Remove duplicate code from densenet load_state_dict (827)
* Replace `endswith` calls in a loop with a single `endswith` call in `DatasetFolder` (832)
* Added missing dot in webp image extensions (836)
* fix inconsistent behavior for `~` expression (850)
* Minor Compressions in statements in `` (874)
* Minor fix to evaluation formula of `PILLOW_VERSION`  in `transforms.functional.affine` (895)
* added `is_valid_file` parameter to `DatasetFolder` (867)
* Add support for joint transformations in `VisionDataset` (872)
* Auto calculating return dimension of `squeezenet` forward method (884)
* Added `progress` flag to model getters (875) (910)
* Add support for other normalizations (i.e., `GroupNorm`) in `ResNet` (813)
* Add dilation option to `ResNet` (866)


* Add basic model testing. (811)
* Add test for `num_class` in `` (815)
* Added test for `normalize` functionality in `make_grid` function. (840)
* Added downloaded directory not empty check in `test_datasets_utils` (844)
* Added test for `save_image` in utils (847)
* Added tests for `check_md5` and `check_integrity` (873)


* Remove shebang in `` (773)
* configurable version and package names (842)
* More hub models (851)
* Update travis to use more recent GCC (891)


* Add comments regarding downsampling layers of resnet (794)
* Remove unnecessary bullet point in InceptionV3 doc (814)
* Fix `crop` and `resized_crop` docs in `` (817)
* Added dimensions in the comments of googlenet (788)
* Update transform doc with random offset of padding due to `pad_if_needed` (791)
* Added the argument `transform_input` in docs of InceptionV3 (789)
* Update documentation for MNIST datasets (778)
* Fixed typo in `normalize()` function. (823)
* Fix typo in squeezenet (841)
* Fix typo in DenseNet comment (857)
* Typo and syntax fixes to transform docstrings (887)






Torchscript support for torchvision.ops

torchvision ops are now natively supported by torchscript. This includes operators such as nms, roi_align and roi_pool, and for the ops that support backpropagation, both eager and torchscript modes are supported in autograd.

New operators

Deformable Convolution (1586) (1660) (1637)

As described in Deformable Convolutional Networks (, torchvision now supports deformable convolutions. The model expects as input both the input as well as the offsets, and can be used as follows:
from torchvision import ops

module = ops.DeformConv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1)
x = torch.rand(1, 1, 10, 10)

number of channels for offset should be a multiple
of 2 * module.weight.size[2] * module.weight.size[3], which correspond
to the kernel_size
offset = torch.rand(1, 2 * 3 * 3, 10, 10)

the output requires both the input and the offsets
out = module(x, offset)

If needed, the user can create their own wrapper module that imposes constraints on the offset. Here is an example, using a single convolution layer to compute the offset:

class BasicDeformConv2d(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=1, stride=1,
dilation=1, groups=1, offset_groups=1):
offset_channels = 2 * kernel_size * kernel_size
self.conv2d_offset = nn.Conv2d(
offset_channels * offset_groups,
self.conv2d = ops.DeformConv2d(

def forward(self, x):
offset = self.conv2d_offset(x)
return self.conv2d(x, offset)

Position-sensitive RoI Pool / Align (1410)

Position-Sensitive Region of Interest (RoI) Align operator mentioned in Light-Head R-CNN ( These are available under ops.ps_roi_align, ps_roi_pool and the module equivalents ops.PSRoIAlign and ops.PSRoIPool, and have the same interface as RoIAlign / RoIPool.

New Features

TorchScript support

* Bugfix in BalancedPositiveNegativeSampler introduced during torchscript support (1670)
* Make R-CNN models less verbose in script mode (1671)
* Minor torchscript fixes for Mask R-CNN (1639)
* remove BC-breaking changes (1560)
* Make maskrcnn scriptable (1407)
* Add Script Support for Video Resnet Models (1393)
* fix ASPPPooling (1575)
* Test that torchhub models are scriptable (1242)
* Make Googlnet & InceptionNet scriptable (1349)
* Make fcn_resnet Scriptable (1352)
* Make Densenet Scriptable (1342)
* make resnext scriptable (1343)
* make shufflenet and resnet scriptable (1270)


* Enable KeypointRCNN test (1673)
* enable mask rcnn test (1613)
* Changes to Enable KeypointRCNN ONNX Export (1593)
* Disable Profiling in Failing Test (1585)
* Enable ONNX Test for FasterRcnn (1555)
* Support Exporting Mask Rcnn to ONNX (1461)
* Lahaidar/export faster rcnn (1401)
* Support Exporting RPN to ONNX (1329)
* Support Exporting MultiScaleRoiAlign to ONNX (1324)
* Support Exporting GeneralizedRCNNTransform to ONNX (1325)


* Update quantized shufflenet weights (1715)
* Add commands to run quantized model with pretrained weights (1547)
* Quantizable googlenet, inceptionv3 and shufflenetv2 models (1503)
* Quantizable resnet and mobilenet models (1471)
* Remove model download from test_quantized_models (1526)



* Bugfix on GroupedBatchSampler for corner case where there are not enough examples in a category to form a batch (1677)
* Fix rpn memory leak and dataType errors. (1657)
* Fix torchvision install due to zippeg egg (1536)


* Make shear operation area preserving (1529)
* PILLOW_VERSION deprecation updates (1501)
* Adds optional fill colour to rotate (1280)


* Add Deformable Convolution operation. (1586) (1660) (1637)
* Fix inconsistent NMS implementation between CPU and CUDA (1556)
* Speed up nms_cuda (1704)
* Implementation for Position-sensitive ROI Pool/Align (1410)
* Remove cpp extensions in favor of torch ops (1348)
* Make custom ops differentiable (1314)
* Fix Windows build in Torchvision Custom op Registration (1320)
* Revert "Register Torchvision Ops as Cutom Ops (1267)" (1316)
* Register Torchvision Ops as Cutom Ops (1267)
* Use Tensor.data_ptr instead of .data (1262)
* Fix header includes for cpu (1644)


* fixed test for windows by closing the created temporary files (1662)
* VideoClips windows fixes (1661)
* Fix VOC on Windows (1641)
* update dead LSUN link (1626)
* DatasetFolder should follow links when searching for data (1580)
* add .tgz support to extract_archive (1650)
* expose audio_channels as a parameter to kinetics dataset (1559)
* Implemented integrity check (md5 hash) after dataset download (1456)
* Move VideoClips dummy dataset to top level for pickling (1649)
* Remove download for ImageNet (1457)
* add tar.xz archive handler (1361)
* Fix DeprecationWarning for collections.Iterable import in LSUN (1417)
* Support empty target_type for CelebA dataset (1351)
* VOC2007 support test set (1340)
* Fix EMNSIT download URL (1297) (1318)
* Refactored clip_sampler (1562)


* Fix documentation for NMS (1614)
* More examples of functional transforms (1402)
* Fixed doc of crop functionals (1388)
* Added Training Sample code for fasterrcnn_resnet50_fpn (1695)
* Fix typo (1276)
* Update README with minimum required version of PyTorch (1272)
* fix alignment of README (1396)
* fixed typo in DatasetFolder and ImageFolder (1284)


* Bugfix for MNASNet (1224)
* Fix anchor dtype in AnchorGenerator (1341)


* Adding File object option to utils.save_image (1301)
* Fix make_grid: support any number of channels in tensor (1300)
* Fix bug of changing input tensor in utils.save_image (1244)

Reference scripts

* add a README for training object detection models (1612)
* Adding args for names of train and val directories (1544)
* Fix broken bitwise operation in Similarity Reference loss (1604)
* Fixing issue 1530 by starting ann_id to 1 in convert_to_coco_api (1531)
* Add commands for model training (1203)
* adding documentation for automatic mixed precision training (1533)
* Fix reference training script for Mask R-CNN for PyTorch 1.2 (during evaluation after epoch, mask datatype became bool, pycocotools expects uint8) (1413)
* fix a little bug about resume (1628)
* Better explain lr and batch size in references/detection/ (1233)
* update default parameters in references/detection (1611)
* Removed code redundancy/refactored inn video_classification (1549)
* Fix comment in default arguments in references/detection (1243)


* Correctness test implemented with old test architecture (1511)
* Simplify and organize test_ops. (1551)
* Replace asserts with assertEqual (1488)(1499)(1497)(1496)(1498)(1494)(1487)(1495)
* Add expected result tests (1377)
* Add TorchHub tests to torchvision (1319)
* Scriptability checks for Tensor Transforms (1690)
* Add tests for results in script vs eager mode (1430)
* Test for checking non mutating behaviour of tensor transforms (1656)
* Disable download tests for Python2 (1269)
* Fix randomresized params flaky (1282)


* Disable C++ models from being compiled without explicit request (1535)
* Fix discrepancy in (1583)
* soumith -> pytorch for docker images (1577)
* [wip] try vs2019 toolchain (1509)
* Make CI use PyTorch nightly (1492)
* Try enabling Windows CUDA CI (1486)
* Fix CUDA builds on Windows (1485)
* Try fix Windows CircleCI (1433)
* Fix CUDA CI (1464)
* Change approach for rebase to master (1427)
* Temporary fix for CI (1411)
* Use PyTorch 1.3 for CI (1467)
* Use links from S3 to install CUDA (1472)
* Enable CUDA 9.2 builds for Windows (1381)
* Fix nightly builds (1374)
* Fix Windows CI after 1301 (1368)
* Retry `anaconda login` for Windows builds (1366)
* Fix nightly wheels builds for Windows (1358)
* Fix CI for py2.7 cu100 wheels (1354)
* Fix Windows CI (1347)
* Windows build scripts (1241)
* Make CircleCI checkout merge commit (1344)
* use native python code generation logic (1321)
* Add CircleCI (v2) (1298)






Mask R-CNN ResNet-50 FPN | 37.9 | 34.6 |
Keypoint R-CNN ResNet-50 FPN | 54.6 |   | 65.0

The implementations of the models for object detection, instance segmentation and keypoint detection are fast, specially during training.

In the following table, we use 8 V100 GPUs, with CUDA 10.0 and CUDNN 7.4 to report the results. During training, we use a batch size of 2 per GPU, and during testing a batch size of 1 is used.

For test time, we report the time for the model evaluation and post-processing (including mask pasting in image), but not the time for computing the precision-recall.

Network | train time (s / it) | test time (s / it) | memory (GB)
-- | -- | -- | --
Faster R-CNN ResNet-50 FPN | 0.2288 | 0.0590 | 5.2
Mask R-CNN ResNet-50 FPN | 0.2728 | 0.0903 | 5.4
Keypoint R-CNN ResNet-50 FPN | 0.3789 | 0.1242 | 6.8

You can load and use pre-trained detection and segmentation models with a few lines of code

import torchvision

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
set it to evaluation mode, as the model behaves differently
during training and during evaluation

image ='/path/to/an/image.jpg')
image_tensor = torchvision.transforms.functional.to_tensor(image)

pass a list of (potentially different sized) tensors
to the model, in 0-1 range. The model will take care of
batching them together and normalizing
output = model([image_tensor])
output is a list of dict, containing the postprocessed predictions

Pixelwise Semantic Segmentation models

**Warning: The API is currently experimental and might change in future versions of torchvision**

The 0.3 release also contains models for dense pixelwise prediction on images.
It adds FCN and DeepLabV3 segmentation models, using a ResNet50 and ResNet101 backbones.
Pre-trained weights for ResNet101 backbone are available, and have been trained on a subset of COCO train2017, which contains the same 20 categories as those from Pascal VOC.

The pre-trained models give the following results on the subset of COCO val2017 which contain the same 20 categories as those present in Pascal VOC:

Network | mean IoU | global pixelwise acc
-- | -- | --


This release is the first one that officially drops support for Python 2.
It contains a number of improvements and bugfixes.


Faster/Mask/Keypoint RCNN supports negative samples

It is now possible to feed training images to Faster / Mask / Keypoint R-CNN that do not contain any positive annotations.
This enables increasing the number of negative samples during training. For those images, the annotations expect a tensor with 0 in the number of objects dimension, as follows:
target = {"boxes": torch.zeros((0, 4), dtype=torch.float32),
"labels": torch.zeros(0, dtype=torch.int64),
"image_id": 4,
"area": torch.zeros(0, dtype=torch.float32),
"masks": torch.zeros((0, image_height, image_width), dtype=torch.uint8),
"keypoints": torch.zeros((17, 0, 3), dtype=torch.float32),
"iscrowd": torch.zeros((0,), dtype=torch.int64)}

Aligned flag for RoIAlign

`RoIAlign` now supports the aligned flag, which aligns more precisely two neighboring pixel indices.

Refactored abstractions for C++ video decoder

This change is transparent to Python users, but the whole C++ backend for video reading (which needs torchvision to be compiled from source for it to be enabled for now) has been refactored into more modular abstractions.
The core abstractions are in, and the video reader functions exposed to Python, by leveraging those abstractions, can be written in [a much more concise way](

Backwards Incompatible Changes

* Dropping Python2 support (1761, 1792, 1984, 1976, 2037, 2033, 2017)
* [Models] Fix inception quantized pre-trained model (1954, 1969, 1975)
* ONNX support for Mask R-CNN and Keypoint R-CNN has been temporarily dropped, but will be fixed in next releases

New Features

* [Transforms] Add Perspective fill option (1973)
* [Ops]  `aligned` flag in ROIAlign (1908)
* [IO] Update video reader to use new decoder (1978)
* [IO] torchscriptable functions for video io (1653, 1794)
* [Models] Support negative samples in Faster R-CNN, Mask R-CNN and Keypoint R-CNN (1911, 2069)



* STL10: don't check integrity twice when download=True (1787)
* Improve code readability and docstring of video datasets(2020)
* [DOC] Fixed typo in Cityscapes docs (1851)


* Allow passing list to the input argument 'scale' of RandomResizedCrop (1997) (2008)
* F.normalize unsqueeze mean & std only for 1-d arrays (2002)
* Improved error messages for transforms.functional.normalize(). (1915)
* generalize number of bands calculation in to_tensor (1781)
* Replace 2 transpose ops with 1 permute in ToTensor(2018)
* Fixed Pillow version check for Pillow >= 10 (2039)
* [DOC]: Improve transforms.Normalize docs (1784, 1858)
* [DOC] Fixed missing new line in transforms.Crop docstring (1922)


* Check boxes shape in RoIPool / Align (1968)
* [ONNX] Export new_empty_tensor (1733)
* Fix Tensor::data<> deprecation. (2028)
* Fix deprecation warnings (2055)


* Add warning and note docs for scipy (1842) (1966)
* Added __repr__ attribute to GeneralizedRCNNTransform (1834)
* Replace mean on dimensions 2,3 by adaptive_avg_pooling2d in mobilenet (1838)
* Add init_weights keyword argument to Inception3 (1832)
* Add device to torch.tensor. (1979)
* ONNX export for variable input sizes in Faster R-CNN (1840)
* [JIT] Cleanup torchscript constant annotations (1721, 1923, 1907, 1727)
* [JIT] use // now that it is supported (1658)
* [JIT] add torch.jit.script to ImageList (1919)
* [DOC] Improved docs for Faster R-CNN (1886, 1868, 1768, 1763)
* [DOC] add comments for the modified implementation of ResNet (1983)
* [DOC] Add comments to AnchorGenerator (1941)
* [DOC] Add comment in GoogleNet (1932)


* Document int8 quantization model (1951)
* Update Doc with ONNX support (1752)
* Update README to reflect strict dependency on torch==1.4.0 (1767)
* Update sphinx theme (2031)
* Document origin of preprocessing mean / std (1965)
* Fix docstring formatting issues (2049)

Reference scripts

* Add return statement in evaluate function of detection reference script (2029)
* [DOC]Add default training parameters to classification reference README (1998)
* [DOC] Add README to references/segmentation (1864)


* Improve stability of test_nms_cuda (2044)
* [ONNX] Disable model tests since export of interpolate script module is broken (1989)
* Skip inception v3 in test/test_quantized_models (1885)
* [LINT] Small indentation fix (1831)


* Remove unintentional -O0 option in (1770)
* Create
* Update issue templates (1913, 1914)
* master version bump 0.5 → 0.6
* replace torch 1.5.0 items flagged with deprecation warnings (fix 1906) (1918)


* Remove av from the binary requirements (2006)
* ci: Add cu102 to CI and packaging, remove cu100 (1980)
* .circleci: Switch to use token for conda uploads (1960)
* Improvements to CI infra (2051, 2032, 2046, 1735, 2048, 1789, 1731, 1961)
* typing only needed for python 3.5 and previous (1778)
* Move C++ and Python linter to CircleCI (2056, 2057)

Bug Fixes


* bug fix on downloading voc2007 test dataset (1991)
* fix lsun docstring example (1935)
* Fixes EMNIST classes attribute is wrong 1716 (1736)
* Force object annotation to be a list in VOC (1790)


* Fix for AnchorGenerator when device switch happen (1745)
* [JIT] fix len error (1981)
* [JIT] fix googlenet no aux logits (1949)
* [JIT] Fix quantized googlenet (1974)


* Fix for rotate fill with Images of type F (1828)
* Fix fill in rotate (1760)


* Fix bug in DeformConv2d for batch sizes > 32 (2027, 2040)
* Fix for roi_align ONNX export (1988)
* Fix torchscript issue in ConvTranspose2d (1917)
* Fix interpolate when no scale_factor is passed (1785)
* Fix Windows build by renaming Python init functions (1779)
* fix for loading models with num_batches_tracked in frozen bn (1728)


* the pts_unit of pts from read_video and read_video_timestamp is deprecated, and will be replaced in next releases with seconds.


This release brings several new additions to torchvision that improves support for deployment. Most notably, all models in torchvision are torchscript-compatible, and can be exported to ONNX. Additionally, a few classification models have quantized weights.

**Note: this is the last version of torchvision that officially supports Python 2.**

Breaking changes

Updated KeypointRCNN pre-trained weights

The pre-trained weights for keypointrcnn_resnet50_fpn have been updated and now correspond to the results reported in the documentation. The previous weights corresponded to an intermediate training checkpoint. (1609)

Corrected the implementation for MNASNet

The previous implementation contained a bug which affects all MNASNet variants other than mnasnet1_0. The bug was that the first few layers needed to also be scaled in terms of width multiplier, along with all the rest. We now provide a new checkpoint for mnasnet0_5, which gives 32.17 top1 error. (1224)


TorchScript support for all models

All models in torchvision have native support for torchscript, for both training and testing. This includes complex models such as DeepLabV3, Mask R-CNN and Keypoint R-CNN.
Using torchscript with torchvision models is easy:
get a pre-trained model
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

convert to torchscript
model_script = torch.jit.script(model)

compute predictions
predictions = model_script([torch.rand(3, 300, 300)])

**Warning: the return type for the scripted version of Faster R-CNN, Mask R-CNN and Keypoint R-CNN is different from its eager counterpart, and it always returns a tuple of losses, detections. This discrepancy will be addressed in a future release.**


All models in torchvision can now be exported to ONNX for deployment. This includes models such as Mask R-CNN.
get a pre-trained model
model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)
inputs = [torch.rand(3, 300, 300)]
predictions = model(inputs)

convert to ONNX
torch.onnx.export(model, inputs, "model.onnx",
opset_version=11   opset_version 11 required for Mask R-CNN

**Warning: for Faster R-CNN / Mask R-CNN / Keypoint R-CNN, the current exported model is dependent on the input shape during export. As such, make sure that once the model has been exported to ONNX that all images that are fed to it have the same shape as the shape used to export the model to ONNX. This behavior will be made more general in a future release.**

Quantized models

torchvision now provides quantized models for ResNet, ResNext, MobileNetV2, GoogleNet, InceptionV3 and ShuffleNetV2, as well as reference scripts for quantizing your own model in references/classification/ ( Obtaining a pre-trained quantized model can be obtained with a few lines of code:
model = torchvision.models.quantization.mobilenet_v2(pretrained=True, quantize=True)

run the model with quantized inputs and weights
out = model(torch.rand(1, 3, 224, 224))

We provide pre-trained quantized weights for the following models:

| Model | Acc1 | Acc5 |
| --- | --- | --- |


This minor release introduces an optimized `video_reader` backend for torchvision. It is implemented in C++, and uses FFmpeg internally.

The new `video_reader` backend can be up to 6 times faster compared to the `pyav` backend.
- When decoding all video/audio frames in the video, the new `video_reader` is 1.2x - 6x faster depending on the codec and video length.
- When decoding a fixed number of video frames (e.g. [4, 8, 16, 32, 64, 128]), `video_reader` runs equally fast for small values (i.e. [4, 8, 16]) and runs up to 3x faster for large values (e.g. [32, 64, 128]).

Using the optimized video backend

Switching to the new backend can be done via `torchvision.set_video_backend('video_reader')` function. By default, we use a backend based on top of [PyAV](

Due to packaging issues with FFmpeg, in order to use the `video_reader` backend one need to first have `ffmpeg` available on the system, and then compile torchvision from source using the instructions from

In torchvision 0.4.0, the `read_video` and `read_video_timestamps` functions used `pts` relative to the video stream. This could lead to unaligned video-audio being returned in some cases.

torchvision now allow to specify a `pts_unit` argument in those functions. The default value is `'pts'` (with same behavior as before), and the user can now specify `pts_unit='sec'`, which produces consistently aligned results for both video and audio. The `'pts'` value is deprecated for now, and kept for backwards-compatibility.

In the next release, the default value of `pts_unit` will change to `'sec'`, so that calling `read_video` without specifying `pts_unit` returns consistently aligned audio-video results. This will require users to update their `VideoClips` checkpoints, which used to store the information in `pts` by default.

- [video reader] inception commit (1303) 31fad34
- Expose frame-rate and cache to video datasets (1356) 85ffd93
- Expose num_workers in VideoClips (1359) 02a8c0a
- Fix randomresized params flaky (1282) 7c9bbf5
- Video transforms (1353) 64917bc
- add _backend argument to init() of class VideoClips (1363) 7874374
- Video clips workers (1369) 0982395
- modified code of io.read_video and io.read_video_timestamps to intepret pts values in seconds (1331) 17e355f
- add metadata to video dataset classes. bug fix. more robustness (1376) 49b01e3
- move sampler into TV core. Update UniformClipSampler (1408) f0d3daa
- remove hardcoded video extension in kinetics400 dataset (1418) 929c81d
- Fix hmdb51 and ucf101 typo (1420) b13931a
- fix a bug related to audio_end_pts (1431) 1258bb7
- expose more io api (1423) e48b958
- Make video transforms private (1429) 79daca1
- extend video reader to support fast video probing (1437) ed5b2dc
- Better handle corrupted videos (1463) da89dad
- Temporary fix to remove ffmpeg from build time (1475) ed04dee
- fix a bug when video decoding fails and empty frames are returned (1506) 2804c12
- extend DistributedSampler to support group_size (1512) 355e9d2
- Unify video backend (1514) 97b53f9
- Unify video metadata in VideoClips (1527) 7d509c5
- Fixed compute_clips docstring (1543) b438d32


This minor release provides binaries compatible with PyTorch 1.3.

Compared to version 0.4.0, it contains a single bugfix for `HMDB51` and `UCF101` datasets, fixed in


This release adds support for video models and datasets, and brings several improvements.

**Note**: torchvision 0.4 requires PyTorch 1.2 or newer


Video and IO

Video is now a first-class citizen in torchvision. The 0.4 release includes:

* efficient IO primitives for reading and writing video files
* Kinetics-400, HMDB51 and UCF101 datasets for action recognition, which are compatible with ``
* Pre-trained models for action recognition, trained on Kinetics-400
*  Training and evaluation scripts for reproducing the training results.

Writing your own video dataset is easy. We provide an utility class `VideoClips` that simplifies the task of enumerating all possible clips of fixed size in a list of video files by creating an index of all clips in a set of videos. It additionally allows to specify a fixed frame-rate for the videos.

from torchvision.datasets.video_utils import VideoClips

class MyVideoDataset(object):
def __init__(self, video_paths):
self.video_clips = VideoClips(video_paths,

def __getitem__(self, idx):
video, audio, info, video_idx = self.video_clips.get_clip(idx)
return video, audio

def __len__(self):
return self.video_clips.num_clips()

We provide pre-trained models for action recognition, trained on Kinetics-400, which reproduce the results on the original papers where they have been first introduced, as well the corresponding training scripts.

|model	|clip  1	|
|---	|---	|
|r3d_18	|52.748	|
|mc3_18	|53.898	|
|r2plus1d_18	|57.498	|


* change aspect ratio calculation formula in `references/detection` (1194)
* bug fixes in ImageNet (1149)
* fix save_image when height or width equals 1 (1059)
* Fix STL10 `__repr__` (969)
* Fix wrong behavior of `GeneralizedRCNNTransform` in Python2. (960)



* Add USPS dataset (961)(1117)
* Added support for the QMNIST dataset (995)
* Add HMDB51 and UCF101 datasets (1156)
* Add Kinetics400 dataset (1077)


* Miscellaneous dataset fixes (1174)
* Standardize str argument verification in datasets (1167)
* Always pass `transform` and `target_transform` to abstract dataset (1126)
* Remove duplicate transform assignment in FakeDataset (1125)
* Automatic extraction for Cityscapes Dataset (1066) (1068)
* Use joint transform in Cityscapes (1024)(1045)
* CelebA: track attr names, support split="all", code cleanup (1008)
* Add folds option to STL10 (914)



* Add pretrained Wide ResNet (912)
* Memory efficient densenet (1003) (1090)
* Implementation of the MNASNet family of models (829)(1043)(1092)
* Add VideoModelZoo models (1130)


* Fix resnet fpn backbone for resnet18 and resnet34 (1147)
* Add checks to `roi_heads` in detection module (1091)
* Make shallow copy of input list in `GeneralizedRCNNTransform` (1085)(1111)(1084)
* Make MobileNetV2 number of channel divisible by 8 (1005)
* typo fix: ouput -> output in Inception and GoogleNet (1034)
* Remove empty proposals from the RPN (1026)
* Remove empty boxes before NMS (1019)
* Reduce code duplication in segmentation models (1009)
* allow user to define residual settings in MobileNetV2 (965)
* Use `flatten` instead of `view` (1134)


* Consistency in detection box format (1110)
* Fix Mask R-CNN docs (1089)
* Add paper references to VGG and Resnet variants (1088)
* Doc, Test Fixes in `Normalize` (1063)
* Add transforms doc to more datasets (1038)
* Corrected typo: 5 to 0.5 (1041)
* Update doc for `torchvision.transforms.functional.perspective` (1017)
* Improve documentation for `fillcolor` option in `RandomAffine` (994)
* Added models information to documentation. (985)
* Add missing import in `` documentation (979)
* Improve `make_grid` docs (964)


* Add test for SVHN (1086)
* Add tests for Cityscapes Dataset (1079)
* Update CI to Python 3.6 (1044)
* Make `test_save_image` more robust (1037)
* Add a generic test for the datasets (1015)
* moved fakedata generation to separate module (1014)
* Create imagenet fakedata on-the-fly (1012)
* Minor test refactorings (1011)
* Add test for CIFAR10(0) (1010)
* Mock MNIST download for less flaky tests (1004)
* Add test for ImageNet (976)(1006)
* Add tests for datasets (966)



* Add Random Erasing for image augmentation (909) (1060) (1087) (1095)


* Allowing 'F' mode for 1 channel FloatTensor in `ToPILImage` (1100)
* Add shear parallel to y-axis (1070)
* fix error message in `to_tensor` (1000)
* Fix TypeError in `RandomResizedCrop.get_params` (1036)
* Fix `normalize` for different `dtype` than `float32` (1021)


* Renamed `vision.h` files to `vision_cpu.h` and `vision_cuda.h` (1051)(1052)
* Optimize `nms_cuda` by avoiding extra `` call (945)

Reference scripts

* Expose data-path in the detection reference scripts (1109)
* Make `` work with pytorch-cpu (1023)
* Add mixed precision training with Apex (972)(1124)
* Add reference code for similarity learning (1101)


* Add windows build steps and wheel build scripts (998)
* add packaging scripts (996)
* Allow forcing GPU build with `FORCE_CUDA=1` (927)


* Misc lint fixes (1020)
* Reraise error on failed downloading (1013)
* add more hub models (974)
* make C extension lazy-import (971)


This release brings several new features to torchvision, including models for semantic segmentation, object detection, instance segmentation and person keypoint detection, and custom C++ / CUDA ops specific to computer vision.

**Note: torchvision 0.3 requires PyTorch 1.1 or newer**


Reference training / evaluation scripts

We now provide under the `references/` folder scripts for training and evaluation of the following tasks: classification, semantic segmentation, object detection, instance segmentation and person keypoint detection.
Their purpose is twofold:

* serve as a log of how to train a specific model.
* provide baseline training and evaluation scripts to bootstrap research

They all have an entry-point `` which performs both training and evaluation for a particular task. Other helper files, specific to each training script, are also present in the folder, and they might get integrated into the torchvision library in the future.

We expect users should copy-paste and modify those reference scripts and use them for their own needs.

TorchVision Ops

TorchVision now contains custom C++ / CUDA operators in `torchvision.ops`. Those operators are specific to computer vision, and make it easier to build object detection models.
Those operators currently do not support PyTorch script mode, but support for it is planned for future releases.

List of supported ops

* `roi_pool` (and the module version `RoIPool`)
* `roi_align` (and the module version `RoIAlign`)
* `nms`, for non-maximum suppression of bounding boxes
* `box_iou`, for computing the intersection over union metric between two sets of bounding boxes

All the other ops present in `torchvision.ops` and its subfolders are experimental, in particular:

* `FeaturePyramidNetwork` is a module that adds a FPN on top of a module that returns a set of feature maps.
* `MultiScaleRoIAlign` is a wrapper around `roi_align` that works with multiple feature map scales

Here are a few examples on using torchvision ops:
import torch
import torchvision

create 10 random boxes
boxes = torch.rand(10, 4) * 100
they need to be in [x0, y0, x1, y1] format
boxes[:, 2:] += boxes[:, :2]
create a random image
image = torch.rand(1, 3, 200, 200)
extract regions in `image` defined in `boxes`, rescaling
them to have a size of 3x3
pooled_regions = torchvision.ops.roi_align(image, [boxes], output_size=(3, 3))
check the size
torch.Size([10, 3, 3, 3])

or compute the intersection over union between
all pairs of boxes
print(torchvision.ops.box_iou(boxes, boxes).shape)
torch.Size([10, 10])

Models for more tasks

The 0.3 release of torchvision includes pre-trained models for other tasks than image classification on ImageNet.
We include two new categories of models: region-based models, like Faster R-CNN, and dense pixelwise prediction models, like DeepLabV3.

Object Detection, Instance Segmentation and Person Keypoint Detection models

**Warning: The API is currently experimental and might change in future versions of torchvision**

The 0.3 release contains pre-trained models for Faster R-CNN, Mask R-CNN and Keypoint R-CNN, all of them using ResNet-50 backbone with FPN.
They have been trained on COCO train2017 following the reference scripts in `references/`, and give the following results on COCO val2017

Network | box AP | mask AP | keypoint AP
-- | -- | -- | --


This version introduces several improvements and fixes.

Support for arbitrary input sizes for models

It is now possible to feed larger images than 224x224 into the models in torchvision.
We added an adaptive pooling just before the classifier, which adapts the size of the feature maps before the last layer, allowing for larger input images.
Relevant PRs: 744 747 746 672 643


* Fix invalid argument error when using lsun method in windows (508)
* Fix FashionMNIST loading MNIST (640)
* Fix inception v3 input transform for trace & onnx (621)


* Add support for webp and tiff images in ImageFolder 736 724
* Add K-MNIST dataset 687
* Add Cityscapes dataset 695 725 739 700
* Add Flicker 8k and 30k datasets 674
* Add VOCDetection and VOCSegmentation datasets 663
* Add SBU Captioned Photo Dataset (665)
* Updated URLs for EMNIST 726
* MNIST and FashionMNIST now have their own 'raw' and 'processed' folder 601
* Add metadata to some datasets (501)


* Allow RandomCrop to crop in the padded region 564
* ColorJitter now supports min/max values 548
* Generalize resnet to use block.extension 487
* Move area calculation out of for loop in RandomResizedCrop 641
* Add option to zero-init the residual branch in resnet (498)
* Improve error messages in to_pil_image 673
* Added the option of converting to tensor for numpy arrays having only two dimensions in to_tensor (686)
* Optimize _find_classes in DatasetFolder via scandir in Python3 (559)
* Add padding_mode to RandomCrop (489 512)
* Make DatasetFolder more generic (527)
* Add in-place option to normalize (699)
* Add Hamming and Box interpolations to (693)
* Added the support of 2-channel Image modes such as 'LA' and adding a mode in 4 channel modes (688)
* Improve support for 'P' image mode in pad (683)
* Make torchvision depend on pillow-simd if already installed (522)
* Make tests run faster (745)
* Add support for non-square crops in RandomResizedCrop (715)

Breaking changes

* save_images now round to nearest integer 754


* Added code coverage to travis 703
* Add downloads and docs badge to README (702)
* Add progress to download_url 497 524 535
* Replace 'residual' with 'identity' in (679)
* Consistency changes in the models
* Refactored MNIST and CIFAR to have data and target fields 578 594
* Update torchvision to newer versions of PyTorch
* Relax assertion in `transforms.Lambda.__init__` (637)
* Cast MNIST target to int (605)
* Change default target type of FakedDataset to long (581)
* Improve docs of functional transforms (602)
* Docstring improvements
* Add is_image_file to folder_dataset (507)
* Add deprecation warning in MNIST train[test]_labels[data] (742)
* Mention TORCH_MODEL_ZOO in models documentation. (624)
* Add scipy as a dependency to (675)
* Added size information for inception v3 (719)


This version introduces several fixes and improvements to the previous version.

Better printing of Datasets and Transforms

* Add descriptions to Transform objects.
Now T.Compose([T.RandomHorizontalFlip(), T.RandomCrop(224), T.ToTensor()]) prints
RandomCrop(size=(224, 224), padding=0)

* Add descriptions to Datasets
now torchvision.datasets.MNIST('~') prints
Dataset MNIST
Number of datapoints: 60000
Split: train
Root Location: /private/home/fmassa
Transforms (if any): None
Target Transforms (if any): None

New transforms

* Add RandomApply, RandomChoice, RandomOrder transformations 402
* RandomApply: applies a list of transformation with a probability
* RandomChoice: choose randomly a single transformation from a list
* RandomOrder: apply transformations in a random order
* Add random affine transformation 411

* Add reflect, symmetric and edge padding to `transforms.pad` 460

Performance improvements

* Speedup MNIST preprocessing by a factor of 1000x
* make weight initialization optional to speed VGG construction. This makes loading pre-trained VGG models much faster
* Accelerate `transforms.adjust_gamma` by using PIL's point function instead of custom numpy-based implementation

New Datasets

* EMNIST - an extension of MNIST for hand-written letters
* OMNIGLOT - a dataset for one-shot learning, with 1623 different handwritten characters from 50 different alphabets
* Add a DatasetFolder class - generalization of ImageFolder

Miscellaneous improvements

* FakeData accepts a seed argument, so having multiple different FakeData instances is now possible
* Use consistent datatypes in Dataset targets. Now all datasets that returns labels will have them as int
* Add probability parameter in `RandomHorizontalFlip` and `RandomHorizontalFlip`
* Replace `np.random` by `random` in transforms - improves reproducibility in multi-threaded environments with default arguments
* Detect tif images in ImageFolder
* Add `pad_if_needed` to `RandomCrop`, so that if the crop size is larger than the image, the image is automatically padded
* Add support in `transforms.ToTensor` for PIL Images with mode '1'


* Fix passing list of tensors to `utils.save_image`
* single images passed to `make_grid` now are now also normalized
* Fix PIL img close warnings
* Added missing weight initializations to densenet
* Avoid division by zero in `make_grid` when the image is constant
* Fix `ToTensor` when PIL Image has mode F
* Fix bug with `to_tensor` when the input is numpy array of type np.float32.


This version introduced a functional interface to the transforms, allowing for joint random transformation of inputs and targets. We also introduced a few breaking changes to some datasets and transforms (see below for more details).

We have introduced a functional interface for the torchvision transforms, available under `torchvision.transforms.functional`. This now makes it possible to do joint random transformations on inputs and targets, which is especially useful in tasks like object detection, segmentation and super resolution. For example, you can now do the following:

from torchvision import transforms
import torchvision.transforms.functional as F
import random

def my_segmentation_transform(input, target):
i, j, h, w = transforms.RandomCrop.get_params(input, (100, 100))
input = F.crop(input, i, j, h, w)
target = F.crop(target, i, j, h, w)
if random.random() > 0.5:
input = F.hflip(input)
target = F.hflip(target)
F.to_tensor(input), F.to_tensor(target)
return input, target

The following transforms have also been added:
- [`F.vflip` and `RandomVerticalFlip`](
- [FiveCrop]( and [TenCrop](
- Various color transformations:
- [`ColorJitter`](
- `F.adjust_brightness`
- `F.adjust_contrast`
- `F.adjust_saturation`
- `F.adjust_hue`
- `LinearTransformation` for applications such as whitening
- `Grayscale` and `RandomGrayscale`
- `Rotate` and `RandomRotation`
- `ToPILImage` now supports `RGBA` images
- `ToPILImage` now accepts a `mode` argument so you can specify which colorspace the image should be
- `RandomResizedCrop` now accepts `scale` and `ratio` ranges as input parameters

Documentation is now auto generated and publishing to [](

SEMEION Dataset of handwritten digits added
Phototour dataset patches computed  via multi-scale Harris corners now available by setting `name` equal to `notredame_harris`, `yosemite_harris` or `liberty_harris` in the `Phototour` dataset

Bug fixes:
- Pre-trained densenet models is now CPU compatible 251

Breaking changes:
This version also introduced some breaking changes:
- The `SVHN` dataset has now been made consistent with other datasets by making the label for the digit 0 be 0, instead of 10 (as it was previously) (see 194 for more details)
- the `labels` for the unlabelled `STL10` dataset is now an array filled with `-1`
- the order of the input args to the deprecated `Scale` transform has changed from `(width, height)` to `(height, width)` to be consistent with other transforms


- Ability to switch image backends between PIL and accimage
- Added more tests
- Various bug fixes and doc improvements


- Fix for inception v3 input transform bug
- Added pretrained VGG models with batch norm


- Fix indexing bug in LSUN dataset (
- enable `~` to be used in dataset paths
- `ImageFolder` now returns the same (sorted) file order on different machines (


- transforms.Scale now accepts a tuple as new size or single integer


- can now pass a pad value to make_grid and save_image


New Features
- SqueezeNet 1.0 and 1.1 models added, along with pre-trained weights
- Add pre-trained weights for VGG models
- Fix location of dropout in VGG
- `torchvision.models` now expose `num_classes` as a constructor argument
- Add InceptionV3 model and pre-trained weights
- Add DenseNet models and pre-trained weights


- Add STL10 dataset
- Add SVHN dataset
- Add PhotoTour dataset

Transforms and Utilities
- `transforms.Pad` now allows fill colors of either number tuples, or named colors like `"white"`
- add normalization options to `make_grid` and `save_image`
- `ToTensor` now supports more input types

Performance Improvements

Bug Fixes
- ToPILImage now supports a single image
- Python3 compatibility bug fixes
- `ToTensor` now copes with all PIL Image types, not just RGB images
- ImageFolder now only scans subdirectories.
- Having files like `.DS_Store` is now no longer a blocking hindrance
- Check for non-zero number of images in ImageFolder
- Subdirectories of classes have recursive scans for images
- LSUN test set loads now


A small release, just needed a version bump because of PyPI.


New Features
- Add `torchvision.models`: Definitions and pre-trained models for common vision models
- ResNet, AlexNet, VGG models added with downloadable pre-trained weights
- adding padding to RandomCrop. Also add `transforms.Pad`
- Add MNIST dataset

Performance Fixes
- Fixing performance of LSUN Dataset

Bug Fixes
- Some Python3 fixes
- Bug fixes in save_image, add single channel support


Introduced Datasets and Transforms.

Added common datasets

- COCO (Captioning and Detection)
- LSUN Classification
- ImageFolder
- Imagenet-12
- CIFAR10 and CIFAR100

- Added utilities for saving images from Tensors.