Torchserve

Latest version: v0.11.0

Safety actively analyzes 630254 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 7

0.8.2

Not secure

This is the release of TorchServe v0.8.2.

Security
+ Updated snakeyaml version to v2 2523 nskool
+ Added warning about model allowed urls when default value is applied 2534 namannandan

Custom metrics backwards compatibility
+ `add_metric` is now backwards compatible with versions [< v0.6.1] but the default metric type is inferred to be `COUNTER`. If the metric is of a different type, it will need to be specified in the call to `add_metric` as follows:
`metrics.add_metric(name='GenericMetric', value=10, unit='count', dimensions=[...], metric_type=MetricTypes.GAUGE)`
+ When upgrading from versions [v0.6.1 - v0.8.1] to v0.8.2, replace the call to `add_metric` with `add_metric_to_cache`.
+ All custom metrics updated in the custom handler will need to be included in the metrics configuration file for them to be emitted by Torchserve. This is shown [here](https://github.com/pytorch/serve/blob/58eb2d2c79cf1cf711abf9ffea5678420a5ff65a/docs/metrics.md#central-metrics-yaml-file-definition).
+ A detailed [upgrade guide](https://github.com/pytorch/serve/blob/04e0b37dafbd9f98a60d040bbc36f64016fc2c8d/docs/metrics.md#backwards-compatibility-warnings-and-upgrade-guide) is included in the [metrics documentation](https://github.com/pytorch/serve/blob/04e0b37dafbd9f98a60d040bbc36f64016fc2c8d/docs/metrics.md).

New Features
+ Supported KServe GPRC v2 2176 jagadeeshi2i
+ Supported K8S session affinity 2519 jagadeeshi2i

New Examples
1. Example LLama v2 70B chat using HuggingFace Accelerate 2494 lxning HamidShojanazeri agunapal

2. large model example OPT-6.7B on Inferentia2 2399 namannandan
- This example demonstrates how NeuronX compiles the model , detects neuron core availability and runs the inference.

3. DeepSpeed deferred init with OPT-30B 2419 agunapal
- This PR added feature `deferred model init` in [OPT-30B example](https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed) by leveraging DeepSpeed new version. This feature is able to significantly reduce model loading latency.

4. Torch TensorRT example 2483 agunapal
- This PR uses Resnet-50 as an example to demonstrate Torch TensorRT.

5. K8S mnist example using minikube 2323 agunapal
- This example shows how to use a pre-trained custom MNIST model to performing real time Digit recognition via K8S.

6. Example for custom metrics 2516 namannandan

7. Example for object detection with ultralytics YOLO v8 model 2508 agunapal

Improvements
+ Migrated publishing torchserve-plugins-sdk from Maven JCenter to Maven Central 2429 2422 namannandan
+ Fixed download model from S3 presigned URL 2416 namannandan
+ Enabled opt-6.7b benchmark on inf2 2400 namannandan
+ Added job Queue Status in describe API 2464 namannandan
+ Added add_metric API to be backward compatible 2525 namannandan
+ Upgraded nvidia base image version to `nvidia/cuda:11.7.1-base-ubuntu20.04` in GPU docker image 2442 agunapal
+ Added Docker regression tests in CI 2403 agunapal
+ Updated release version 2533 agunapal
+ Upgraded default cuda to 11.8 in docker image build 2489 agunapal
+ Updated docker nightly build parameters 2493 agunapal
+ Added path to save ab benchmark profile graph in benchmark report 2451 agunapal
+ Added profile information for benchmark 2470 agunapal
+ Fixed manifest null in base handler 2488 pedrogengo
+ Fixed batching input in DALI example 2455 jagadeeshi2i
+ Fixed metrcis for K8S setup 2473 jagadeeshi2i
+ Fixed kserve storage optional package in Dockerfile 2537 jagadeeshi2i
+ Fixed typo in ModelConfig.java comments 2506 arnavmehta7
+ Fixed netty direct buffer issues in torchserve-plugins-sdk 2511 marrodion
+ Fixed typo in ts/context.py comments 2536 ethankim00
+ Fixed Server error when gRPC client close connection unexpectedly 2420 lxning

Documentation
+ Updated large model documentation 2468 sekyondaMeta
+ Updated Sphinx landing page and requirements 2428 2520 sekyondaMeta
+ Updated G analytics in docs 2449 sekyondaMeta
+ Added performance checklist in docs 2526 sekyondaMeta
+ Added performance guidance in FAQ 2524 sekyondaMeta
+ Added instruction for embedding handler examples 2431 sidharthrajaram
+ Updated PyPi description 2445 bryanwweber agunapal
+ Updated Better Transformer README 2474 HamidShojanazeri
+ Fixed typo in microbatching README 2484 InakiRaba91
+ Fixed broken link in kubernetes AKS README 2490 agunapal
+ Fixed lint error 2497 ankithagunapal
+ Updated instructions for building GPU docker image for ONNX 2435 agunapal

Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

0.8.1

Not secure

Improvements
+ **Fixed GPU memory high usage issue and updated model zoo** - Fixed [duplicate process on GPU device ](https://github.com/pytorch/serve/issues/1037).
+ **gRPC max_request_size support**- Added support for gRPC max_request_size configuration in config.properties.
+ **Non SSL request support** - Added [support for non SSL request](https://github.com/pytorch/serve/issues/202).
+ **Benchmark automation support** - Added [support](https://github.com/pytorch/serve/blob/release_0.4.0/docs/management_api.md#register-a-model) for benchmark automation.
+ **Support mar file generation automation** - Added [mar file generation automation](https://github.com/pytorch/serve/issues/1050).

Community Contributions
+ **Fairseq NMT example** - Added [Fairseq Neural Machine Translation example](https://github.com/pytorch/serve/blob/master/examples/nmt_transformer/README.md) (contributed by AshwinChafale)
+ **DeepLabV3 Image Segmentation example** - Added [DeepLabV3 Image Segmentation example](https://github.com/pytorch/serve/blob/master/examples/image_segmenter/deeplabv3/README.md) (contributed by alvarobartt)

Bug Fixes
+ **Huggingface_Transformers model example** - Fixed [Captum explanations fails with HF models](https://github.com/pytorch/serve/issues/934).

Platform Support

0.8.0

Not secure

This is the release of TorchServe v0.8.0.

New Features

1. **Supported [large model inference](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/docs/large_model_inference.md?plain=1#L1) in distributed environment 2193 2320 2209 2215 2310 2218 lxning HamidShojanazeri**

TorchServe added the deep integration to support large model inference. It provides PyTorch native large model inference solution by integrating [PiPPy](https://github.com/pytorch/tau/tree/main/pippy). It also provides the flexibility and extensibility to support other popular libraries such as Microsoft Deepspeed, and HuggingFace Accelerate.

2. **Supported streaming response for GRPC 2186 and HTTP 2233 lxning**

To improve UX in Generative AI inference, TorchServe allows for sending intermediate token response to client side by supporting [GRPC server side streaming](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/docs/grpc_api.md?plain=1#L74) and [HTTP 1.1 chunked encoding ](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/docs/inference_api.md?plain=1#L103).

3. **Supported PyTorch/XLA on GPU and TPU 2182 morgandu**

By leveraging `torch.compile` it's now possible to run torchserve using XLA which is optimized for both GPU and TPU deployments.

4. **Implemented [New Metrics platform](https://github.com/pytorch/serve/issues/1492) #2199 2190 2165 namannandan lxning**

TorchServe fully supports [metrics](https://github.com/pytorch/serve/blob/master/docs/metrics.md#introduction) in Prometheus mode or Log mode. Both frontend and backend metrics can be configured in a [central metrics YAML file](https://github.com/pytorch/serve/blob/master/docs/metrics.md#central-metrics-yaml-file-definition).

5. **Supported map based model config YAML file. 2193 lxning**

Added [config-file](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/model-archiver/README.md?plain=1#L119) option for model config to model archiver tool. Users is able to flexibly define customized parameters in this YAML file, and easily access them in backend handler via variable [context.model_yaml_config](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/docs/configuration.md?plain=1#L267). This new feature also made TorchServe easily support the other new features and enhancements.

6. **Refactored PT2.0 support 2222 msaroufim**

We've refactored our model optimization utilities, improved logging to help debug compilation issues. We've also now deprecated `compile.json` in favor of using the new YAML config format, follow our guide here to learn more https://github.com/pytorch/serve/blob/master/examples/pt2/README.md the main difference is while archiving a model instead of passing in `compile.json` via `--extra-files` we can pass in a `--config-file model_config.yaml`

7. **Supported user specified gpu deviceIds for a model 2193 lxning**

By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. Starting from v0.8.0, TorchServe allows users to define [deviceIds](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/model-archiver/README.md?plain=1#L175) in the [model_config.yaml.](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/model-archiver/README.md?plain=1#L162) to assign GPUs to a model.

8. **Supported cpu model on a GPU host 2193 lxning**

TorchServe supports hybrid mode on a GPU host. Users are able to define [deviceType](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/model-archiver/README.md?plain=1#L174) in model config YAML file to deploy a model on CPU of a GPU host.

9. **Supported Client Timeout 2267 lxning**

TorchServe allows users to define [clientTimeoutInMills](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L40) in a model config YAML file. TorchServe calculates the expired timestamp of an incoming inference request if [clientTimeoutInMills](https://github.com/pytorch/serve/blob/614bfc0a382d809d5fd59d0dbc130c57e67c3332/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L40) is set, and drops the request once it is expired.

10. **Updated ping endpoint default behavior 2254 lxning**

Supported [maxRetryTimeoutInSec](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/frontend/archive/src/main/java/org/pytorch/serve/archive/model/ModelConfig.java#L35), which defines the max maximum time window of recovering a dead backend worker of a model, in model config YAML file. The default value is 5 min. Users are able to adjust it in model config YAML file. The [ping endpoint](https://github.com/pytorch/serve/blob/2f1f52f553e83703b5c380c2570a36708ee5cafa/docs/inference_api.md?plain=1#L26) returns 200 if all models have enough healthy workers (ie, equal or larger the minWorkers); otherwise returns 500.

New Examples

+ **[Example of Pippy](https://github.com/pytorch/serve/tree/master/examples/large_models/Huggingface_pippy) onboarding Open platform framework for distributed model inference #2215 HamidShojanazeri**

+ **[Example of DeepSpeed](https://github.com/pytorch/serve/tree/master/examples/large_models/deepspeed/opt) onboarding Open platform framework for distributed model inference #2218 lxning**

+ **[Example of Stable diffusion v2](https://github.com/pytorch/serve/tree/master/examples/diffusers) #2009 jagadeeshi2i**

Improvements
+ Upgraded to PyTorch 2.0 2194 agunapal

+ Enabled Core pinning in CPU nightly benchmark 2166 2237 min-jean-cho

TorchServe can be used with [Intel® Extension for PyTorch*](https://github.com/intel/intel-extension-for-pytorch) to give performance boost on Intel hardware. Intel® Extension for PyTorch* is a Python package extending PyTorch with up-to-date features optimizations that take advantage of AVX-512 Vector Neural Network Instructions (AVX512 VNNI), Intel® Advanced Matrix Extensions (Intel® AMX), and more.

![dashboard](https://github.com/pytorch/serve/assets/93151422/164b5d16-4edc-4575-b0b4-e2b2306076c3)

Enabling core pinning in TorchServe CPU nightly benchmark shows significant performance speedup. This feature is implemented via a script under PyTorch Xeon backend, initiated from Intel® Extension for PyTorch*. To try out core pinning on your workload, add `cpu_launcher_enable=true` in `config.properties`.

To try out more optimizations with Intel® Extension for PyTorch*, install Intel® Extension for PyTorch* and add `ipex_enable=true` in `config.properties`.

+ Added Neuron nightly benchmark dashboard 2171 2167 namannandan
+ Enabled torch.compile support for torch 2.0.0 pre-release 2256 morgandu
+ Fixed torch.compile mac regression test 2250 msaroufim
+ Added configuration option to disable system metrics 2104 namannandan
+ Added regression test cases for SageMaker MME contract 2200 agunapal

In case of OOM , return error code 507 instead of generic code 503

+ Fixed Error thrown in KServe while loading multi-models 2235 jagadeeshi2i
+ Added Docker CI for TorchServe 2226 fabridamicelli
+ Change docker image release from dev to production 2227 agunapal

+ Supported building docker images with specified Python version 2154 agunapal
+ Model archiver optimizations:

a). Added wildcard file search in model archiver --extra-file 2142 gustavhartz
b). Added zip-store option to model archiver tool 2196 mreso
c). Made model archiver tests runnable from any directory 2191 mreso
d). Supported tgz format model decompression in TorchServe frontend 2214 lxning

+ Enabled batch processing in example scripted tokenizer 2130 mreso
+ Made handler tests callable with pytest 2173 mreso
+ Refactored sanity tests 2219 mreso
+ Improved benchmark tool 2228 and added auto-validation 2144 2157 agunapal

Automatically flag deviation of metrics from the average of last 30 runs

+ Added notification for CI jobs' (benchmark, regression test) failure agunapal
+ Updated CI to run on ubuntu 20.04 2153 agunapal
+ Added github code scanning codeql.yml 2149 msaroufim
+ freeze pynvml version to avoid crash in nvgpu 2138 mreso
+ Made pre-commit usage clearer in error message 2241 and upgraded isort version 2132 msaroufim

Dependency Upgrades

Documentation
+ Nvidia MPS integration study 2205 mreso

This study compares TPS b/w TorchServe with Nvidia MPS enabled and TorchServe without Nvidia MPS enabled on P3 and G4. It can help to the decision in enabling MPS for your deployment or not.

+ Updated TorchServe page on pytorch.org 2243 agunapal
+ Lint fixed broken windows Conda link 2240 msaroufim
+ Corrected example PT2 doc 2244 samils7
+ Fixed regex error in Configuration.md 2172 mpoemsl
+ Fixed dead Kubectl links 2160 msaroufim
+ Updated model file docs in example doc 2148 tmc
+ Example for serving TorchServe using docker 2118 agunapal
+ Updated walmart blog link 2117 agunapal

Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

0.7.2

+ **TorchServe Profiling** - Added end-to-end profiling of inference requests. The time taken for different events by TorchServe for an inference request is captured in TorchServe metrics logs
+ **Serving SDK** - Release TorchServe Serving SDK 0.4.0 on maven with contracts/interfaces for Metric Endpoint plugin and Snapshot plugins
+ **Naked DIR support** - Added support for Model Archives as Naked DIRs with the `--archive-format no-archive`
+ **Local file URL support** - Added support for registering model through local file (`file:///`) URLs
+ **Install dependencies** - Added a more robust install dependency script certified across different OS platforms (Ubuntu 18.04, MacOS, Windows 10 Pro, Windows Server 2019)
+ **Link Checker** - Added link checker in sanity script to report any broken links in documentation
+ **Enhanced model description** - Added GPU usage info and worker PID in model description
+ **FAQ guides** - Added most [frequently asked questions](https://github.com/pytorch/serve/blob/master/docs/FAQs.md) by community users
+ **Troubleshooting guide** - Added documentation for [troubleshooting common problems](https://github.com/pytorch/serve/blob/master/docs/Troubleshooting.md) related to model serving by TorchServe
+ **Use case guide** - Provides the reference [use cases](https://github.com/pytorch/serve/blob/master/docs/use_cases.md) i.e. different ways in which TorchServe can be deployed for serving different types of PyTorch models

0.7.1

Not secure

This is the release of TorchServe v0.7.1.
Security
+ Upgraded com.google.code.gson:gson from 2.10 to 2.10.1 in serving sdk - https://github.com/pytorch/serve/pull/2096 snyk-bot
+ Upgraded ubuntu from 20.04 to rolling in Dockerfile files - https://github.com/pytorch/serve/pull/2066, https://github.com/pytorch/serve/pull/2065, https://github.com/pytorch/serve/pull/2064 msaroufim
+ Update to safe snakeyaml, grpc and gradle - https://github.com/pytorch/serve/pull/2081 jack-gits
Updated Dockerfile.dev to install gnupg before calling apt-key del 7fa2af80 - https://github.com/pytorch/serve/pull/2076 yeahdongcn

Dependency Upgrades
+ Support PyTorch 1.13.1 - https://github.com/pytorch/serve/pull/2078 agunapal

Improvements
+ Removed bad eval when onnx session used - https://github.com/pytorch/serve/pull/2034 msaroufim
+ Updated runner label in regression_tests_gpu.yml - https://github.com/pytorch/serve/pull/2080 lxning
+ Updated nightly benchmark config - https://github.com/pytorch/serve/pull/2092 lxning

Documentation
+ Added TorchServe 2022 blogs in Readme - https://github.com/pytorch/serve/pull/2060 msaroufim
The blogs are [Torchserve Performance Tuning, Animated Drawings Case-Study](https://pytorch.org/blog/torchserve-performance-tuning/), [Walmart Search: Serving Models at a Scale on TorchServe](https://medium.com/walmartglobaltech/search-model-serving-using-pytorch-and-torchserve-6caf9d1c5f4d), [Scaling inference on CPU with TorchServe](https://www.youtube.com/watch?v=066_Jd6cwZg), and [TorchServe C++ backend](https://www.youtube.com/watch?v=OSmGGDpaesc).
+ Fixed HuggingFace large model instruction - https://github.com/pytorch/serve/pull/2087 HamidShojanazeri
+ Reworded examples Readme to highlight examples - https://github.com/pytorch/serve/pull/2086 agunapal
+ Updated torchserve_on_win_native.md - https://github.com/pytorch/serve/pull/2050 blackrabbit
+ Fixed typo in batch inference md - https://github.com/pytorch/serve/pull/2049 MasoudKaviani

Deprecation
+ Deprecated future package and drop Python2 support - https://github.com/pytorch/serve/pull/2082 namannandan

Platform Support
Ubuntu 16.04, Ubuntu 18.04, Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

0.7.0

Not secure

This is the release of TorchServe v0.7.0.

New Examples
+ HF + Better Transformer integration https://github.com/pytorch/serve/pull/2002 HamidShojanazeri

Better Transformer / Flash Attention & Xformer Memory Efficient provides out of box performance with major speed ups for [PyTorch Transformer encoders](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/). This has been integrated into Torchserve HF Transformer example, please read more about this integration [here]( https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2).

Main speed ups in Better Transformers comes from exploiting sparsity on padded inputs and kernel fusions. As a result you would see the biggest gains when dealing with larger workloads, such sequences with longer paddings and larger batch sizes.

In our benchmarks on P3 instances with 4 V100 GPUs, using Torchserve benchmarking workloads, throughput has shown significant improvement with large batch sizes. 45.5% increase with batch size 8; 50.8% increase with batch size 16; 45.2% increase with batch size 32; 47.2% increase with batch size 64. and 17.2 increase with batch size 4. These number can vary based on your workload (batch size , padding percentage) and your hardware. Please look up some other benchmarks in the [blog post](https://medium.com/pytorch/bettertransformer-out-of-the-box-performance-for-huggingface-transformers-3fbe27d50ab2).

+ `torch.compile()` support https://github.com/pytorch/serve/pull/1960 msaroufim

We've added experimental support for PT 2.0 as in torch.compile() support within torchserve. To use it you need to supply a file `compile.json` when archiving your model to specify which backend you want. We've also enabled by default `mode=reduce-overhead` which is ideally suited for smaller batch sizes which are more common for inference. We recommend for now to leverage GPUs with tensor cores available like A10G or A100 since you're likely to see the greatest speedups there.

On training we've seen speedups ranging from 30% to 2x https://pytorch.org/get-started/pytorch-2.0/ but we haven't ran any performance benchmarks yet for inference. Until then we recommend you continue leveraging other runtimes like TensorRT or IPEX for accelerated inference which we highlight in our `performance_guide.md`. There are a few important caveats to consider when you're using torch.compile: changes in batch sizes will cause recompilations so make sure to leverage a small batch size, there will be additional overhead to start a model since you need to compile it first and you'll likely still see the largest speedups with TensorRT.

However, we hope that adding this support will make it easier for you to benchmark and try out PT 2.0. Learn more here https://github.com/pytorch/serve/tree/master/examples/pt2

Dependency Upgrades
+ Support Python 3.10 https://github.com/pytorch/serve/pull/2031 agunapal
+ Support PyTorch 1.13 and Cuda 11.7 https://github.com/pytorch/serve/pull/1980 agunapal
+ Update docker default from Ubuntu 18.04 to Ubuntu 20.04 (LTS) https://github.com/pytorch/serve/pull/1970 LuigiCerone

Improvements
+ KFServe upgrade to 0.9 - https://github.com/pytorch/serve/issues/1860 Jagadeesh
+ Added pyyaml for python venv https://github.com/pytorch/serve/pull/2014 lxning
+ Added HG BERT better transformer benchmark https://github.com/pytorch/serve/issues/2024 lxning

Documentation
+ Fixed response time unit https://github.com/pytorch/serve/pull/2015 lxning

Platform Support
Ubuntu 16.04, Ubuntu 18.04, MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe now requires Python 3.8 and above, and JDK17.

GPU Support

Page 4 of 7

Releases

Has known vulnerabilities

Previous Next

Torchserve

Page 4 of 7

0.8.2

0.8.1

0.8.0

0.7.2

0.7.1

0.7.0

Page 4 of 7

Links

Releases