Highlights
The CuPy v8.0.0 release includes a number of new features, as well as enhanced NumPy/SciPy functionality coverage.
* **TensorFloat-32 (TF32) Support**
* CuPy now supports [TensorFloat-32](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/), a new feature available in NVIDIA Ampere GPU and CUDA 11. Set `CUPY_TF32=1` environment variable to boost the performance of matrix multiplications in routines such as `cupy.matmul` or `cupy.tensordot`.
* **Official support for NVIDIA cuTENSOR and CUB libraries**
* Several routines in CuPy now support using the [cuTENSOR](https://docs.nvidia.com/cuda/cutensor/index.html) and [CUB](https://nvlabs.github.io/cub/) libraries to further improve performance. Set `CUPY_ACCELERATORS=cub,cutensor` environment variable to benefit from these libraries.
* **Enhanced kernel fusion**
* While combining multiple kernels into a single one using `cupy.fuse`, it was only possible to use a single reduction operation (`cupy.sum`, etc.) at the end. With the new kernel fusion mechanism available in CuPy v8, now it is possible to combine multiple element-wise operations with interleaved reductions.
* **Automatic tuning of kernel launch parameters**
* CuPy now supports discovering the optimal CUDA kernel launch parameters depending on the data and device properties for better performance. See the API reference ([`cupyx.optimizing.optimize`](https://docs.cupy.dev/en/latest/reference/generated/cupyx.optimizing.optimize.html)) for details.
* **Memory pool sharing with external libraries**
* With the new `PythonFunctionAllocator` API, you can let CuPy use arbitrary Python functions instead of a built-in memory pool when managing GPU memory. This improves interoperability with external libraries; for example, you can flexibly use CuPy to preprocess data or use its custom CUDA kernel features inside PyTorch. With [pytorch-pfn-extras](https://github.com/pfnet/pytorch-pfn-extras) bundled allocator it is possible to [easily use the PyTorch memory pool from CuPy](https://github.com/pfnet/pytorch-pfn-extras/blob/master/docs/cuda.md).
* **Improved NumPy/SciPy function coverage**
* Many functions added, including the NumPy Polynomials package (results of [Google Summer of Code 2020](https://summerofcode.withgoogle.com/archive/2020/projects/5856911817179136/), thanks Dahlia-Chehata!), the SciPy image processing package, and extended support for the SciPy sparse matrices package.
For the list of all backward-incompatible changes in v8, please refer to the [Upgrade Guide](https://docs.cupy.dev/en/latest/upgrade.html#cupy-v8).
Notes on Wheel Packages
* CuPy for CUDA 10.1 (`cupy-cuda101`), 10.2 (`cupy-cuda102`), and 11.0 (`cupy-cuda110`) packages are built with cuDNN v8 support but without bundled cuDNN shared libraries (see 3724 for the discussion). To use cuDNN features, You need to download cuDNN library using the following command: `python -m cupyx.tools.install_library --library cudnn --cuda X.X`. It is also possible to install cuDNN v8.0.x via the system package manager (e.g., `apt install libcudnn8` or `yum install libcudnn8`) or manually install it and set `LD_LIBRARY_PATH` environment variables.