Ray

Latest version: v2.22.0

Safety actively analyzes 630094 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 15

2.1.0

Not secure
Release Highlights

* Ray AI Runtime (AIR)
* Better support for Image-based workloads.
* Ray Datasets `read_images()` API for loading data.
* Numpy-based API for user-defined functions in Preprocessor.
* Ability to read TFRecord input.
* Ray Datasets `read_tfrecords()` API to read TFRecord files.
* Ray Serve:
* Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.
* RLlib:
* Introduce decision transformer (DT) algorithm.
* New hook for callbacks with `on_episode_created()`.
* Learning rate schedule to SimpleQ and PG.
* Ray Core:
* Ray [OOM prevention](https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) (alpha release).
* Support dynamic generators as task return values.
* Dashboard:
* Time series metrics support.
* Export configuration files can be used in Prometheus or Grafana instances.
* New progress bar in job detail view.

Ray Libraries

Ray AIR

💫Enhancements:
* Improve readability of training failure output (27946, 28333, 29143)
* Auto-enable GPU for Predictors (26549)
* Add ability to create TorchCheckpoint from state dict (27970)
* Add ability to create TensorflowCheckpoint from saved model/h5 format (28474)
* Add attribute to retrieve URI from Checkpoint (28731)
* Add all allowable types to WandB Callback (28888)

🔨 Fixes:
* Handle nested metrics properly as scoring attribute (27715)
* Fix serializability of Checkpoints (28387, 28895, 28935)

📖Documentation:
* Miscellaneous updates to documentation and examples (28067, 28002, 28189, 28306, 28361, 28364, 28631, 28800)

🏗 Architecture refactoring:
* Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (28318)
* Deprecate legacy train/tune functions in favor of Session (28856)

Ray Data Processing

🎉 New Features:
* Add read_images (29177)
* Add read_tfrecords (28430)
* Add NumPy batch format to Preprocessor and `BatchMapper` (28418)
* Ragged tensor extension type (27625)
* Add KBinsDiscretizer Preprocessor (28389)

💫Enhancements:
* Simplify to_tf interface (29028)
* Add metadata override and inference in `Dataset.to_dask()` (28625)
* Prune unused columns before aggregate (28556)
* Add Dataset.default_batch_format (28434)
* Add partitioning parameter to read_ functions (28413)
* Deprecate "native" batch format in favor of "default" (28489)
* Support None partition field name (28417)
* Re-enable Parquet sampling and add progress bar (28021)
* Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (27964)
* Customized serializer for Arrow JSON ParseOptions in read_json (27911)
* Optimize groupby/mapgroups performance (27805)
* Improve size estimation of image folder data source (27219)
* Use detached lifetime for stats actor (25271)
* Pin _StatsActor to the driver node (27765)
* Better error message for partition filtering if no file found (27353)
* Make Concatenator deterministic (27575)
* Change FeatureHasher input schema to expect token counts (27523)
* Avoid unnecessary reads when truncating a dataset with `ds.limit()` (27343)
* Hide tensor extension from UDFs (27019)
* Add __repr__ to AIR classes (27006)

🔨 Fixes:
* Add upper bound to pyarrow version check (29674) (29744)
* Fix map_groups to work with different output type (29184)
* read_csv not filter out files by default (29032)
* Check columns when adding rows to TableBlockBuilder (29020)
* Fix the peak memory usage calculation (28419)
* Change sampling to use same API as read Parquet (28258)
* Fix column assignment in Concatenator for Pandas 1.2. (27531)
* Doing partition filtering in reader constructor (27156)
* Fix split ownership (27149)

📖Documentation:
* Clarify dataset transformation. (28482)
* Update map_batches documentation (28435)
* Improve docstring and doctest for read_parquet (28488)
* Activate dataset doctests (28395)
* Document using a different separator for read_csv (27850)
* Convert custom datetime column when reading a CSV file (27854)
* Improve preprocessor documentation (27215)
* Improve `limit()` and `take()` docstrings (27367)
* Reorganize the tensor data support docs (26952)
* Fix nyc_taxi_basic_processing notebook (26983)

Ray Train

🎉 New Features:
* Add FullyShardedDataParallel support to TorchTrainer (28096)

💫Enhancements:
* Add rich notebook repr for DataParallelTrainer (26335)
* Fast fail if training loop raises an error on any worker (28314)
* Use torch.encode_data with HorovodTrainer when torch is imported (28440)
* Automatically set NCCL_SOCKET_IFNAME to use ethernet (28633)
* Don't add Trainer resources when running on Colab (28822)
* Support large checkpoints and other arguments (28826)

🔨 Fixes:
* Fix and improve HuggingFaceTrainer (27875, 28154, 28170, 28308, 28052)
* Maintain dtype info in LightGBMPredictor (28673)
* Fix prepare_model (29104)
* Fix `train.torch.get_device()` (28659)

📖Documentation:
* Clarify LGBM/XGB Trainer documentation (28122)
* Improve Hugging Face notebook example (28121)
* Update Train API reference and docs (28192)
* Mention FSDP in HuggingFaceTrainer docs (28217)

🏗 Architecture refactoring:
* Improve Trainer modularity for extensibility (28650)

Ray Tune

🎉 New Features:
* Add `Tuner.get_results()` to retrieve results after restore (29083)

💫Enhancements:
* Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (27174)
* Add rich notebook output for Tune progress updates (26263)
* Add logdir to W&B run config (28454)
* Improve readability for long column names in table output (28764)
* Add functionality to recover from latest available checkpoint (29099)
* Add retry logic for restoring trials (29086)

🔨 Fixes:
* Re-enable progress metric detection (28130)
* Add timeout to retry_fn to catch hanging syncs (28155)
* Correct PB2’s beta_t parameter implementation (28342)
* Ignore directory exists errors to tackle race conditions (28401)
* Correctly overwrite files on restore (28404)
* Disable pytorch-lightning multiprocessing per default (28335)
* Raise error if scheduling an empty PlacementGroupFactory28445
* Fix trial cleanup after x seconds, set default to 600 (28449)
* Fix trial checkpoint syncing after recovery from other node (28470)
* Catch empty hyperopt search space, raise better Tuner error message (28503)
* Fix and optimize sample search algorithm quantization logic (28187)
* Support tune.with_resources for class methods (28596)
* Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (28511)
* Raise better error for incompatible gcsfs version (28772)
* Ensure that exploited in-memory checkpoint is used by trial with PBT (28509)
* Fix Tune checkpoint tracking for minimizing metrics (29145)

📖Documentation:
* Miscelleanous documentation fixes (27117, 28131, 28210, 28400, 28068, 28809)
* Add documentation around trial/experiment checkpoint (28303)
* Add basic parallel execution guide for Tune (28677)
* Add example PBT notebook (28519)

🏗 Architecture refactoring:
* Store SyncConfig and CheckpointConfig in Experiment and Trial (29019)

Ray Serve

🎉 New Features:
* Added gRPC direct ingress support [alpha version] (28175)
* Serve cli can provide kubernetes formatted output (28918)
* Serve cli can provide user config output without default value (28313)

💫Enhancements:
* Enrich more benchmarks
* image objection with resnet50 mode with image preprocessing (29096)
* gRPC vs HTTP inference performance (28175)
* Add health check metrics to reflect the replica health status (29154)

🔨 Fixes:
* Fix memory leak issues during inference (29187)
* Fix unexpected http options omit warning when using serve cli to start the ray serve (28257)
* Fix unexpected long poll exceptions (28612)

📖Documentation:
* Add e2e fault tolerance instructions (28721)
* Add Direct Ingress instructions (29149)
* Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (29147, 28708, 28529, 28527)

RLlib

🎉 New Features:
* Decision Transformer (DT) Algorithm added (27890, 27889, 27872, 27829).
* Callbacks now have a new hook `on_episode_created()`. (28600)
* Added learning rate schedule to SimpleQ and PG. (28381)

💫Enhancements:
* Soft target network update is now supported by all off-policy algorithms (e.g DQN, DDPG, etc.) (28135)
* Stop RLlib from "silently" selecting atari preprocessors. (29011)
* Improved offline RL and off-policy evaluation performance (28837, 28834, 28593, 28420, 28136, 28013, 27356, 27161, 27451).
* Escalated old deprecation warnings to errors (28807, 28795, 28733, 28697).
* Others: 27619, 27087.

🔨 Fixes:
* Various bug fixes: 29077, 28811, 28637, 27785, 28703, 28422, 28405, 28358, 27540, 28325, 28357, 28334, 27090, 28133, 27981, 27980, 26666, 27390, 27791, 27741, 27424, 27544, 27459, 27572, 27255, 27304, 26629, 28166, 27864, 28938, 28845, 28588, 28202, 28201, 27806

📖Documentation:
* Connectors. (27528)
* Training step API. (27344)
* Others: 28299, 27460

Ray Workflows

🔨 Fixes:
* Fixed the object loss due to driver exit (29092)
* Change the name in step to task_id (28151)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:
* Ray [OOM prevention](https://docs.ray.io/en/master/ray-core/scheduling/ray-oom-prevention.html) feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.
* Support dynamic generators as task return values. (29082 28864 28291)

💫Enhancements:
* Fix spread scheduling imbalance issues (28804 28551 28551)
* Widening range of grpcio versions allowed (28623)
* Support encrypted redis connection. (29109)
* Upgrade redis from 6.x to 7.0.5. (28936)
* Batch ScheduleAndDispatchTasks calls (28740)

🔨 Fixes:
* More robust spilled object deletion (29014)
* Fix the initialization/destruction order between reference_counter_ and node change subscription (29108)
* Suppress the logging error when python exits and actor not deleted (27300)
* Mark `run_function_on_all_workers` as deprecated until we get rid of this (29062)
* Remove unused args for default_worker.py (28177)
* Don't include script directory in sys.path if it's started via python -m (28140)
* Handling edge cases of max_cpu_fraction argument (27035)
* Fix out-of-band deserialization of actor handle (27700)
* Allow reuse of cluster address if Ray is not running (27666)
* Fix a uncaught exception upon deallocation for actors (27637)
* Support placement_group=None in PlacementGroupSchedulingStrategy (27370)

📖Documentation:
* [Ray 2.0 white paper](https://docs.google.com/document/d/1tBw9A4j62ruI5omIJbMxly-la5w4q_TjyJgJL_jN2fI/preview) is published.
* Revamp ray core docs (29124 29046 28953 28840 28784 28644 28345 28113 27323 27303)
* Fix cluster docs (28056 27062)
* CLI Reference Documentation Revamp (27862)

Ray Clusters

💫Enhancements:
* Distinguish Kubernetes deployment stacks (28490)

📖Documentation:
* State intent to remove legacy Ray Operator (29178)
* Improve KubeRay migration notes (28672)
* Add FAQ for cluster multi-tenancy support (29279)

Dashboard

🎉 New Features:
* Time series metrics are now built into the dashboard
* Ray now exports some default configuration files which can be used for your Prometheus or Grafana instances. This includes default metrics which show common information important to your Ray application.
* New progress bar is shown in the job detail view. You can see how far along your ray job is.

🔨 Fixes:
* Fix to prometheus exporter producing a slightly incorrect format.
* Fix several performance issues and memory leaks

📖Documentation:
* Added additional documentation on the new time series and the metrics page

Many thanks to all those who contributed to this release!

sihanwang41, simon-mo, avnishn, MyeongKim, markrogersjr, christy, xwjiang2010, kouroshHakha, zoltan-fedor, wumuzi520, alanwguo, Yard1, liuyang-my, charlesjsun, DevJake, matteobettini, jonathan-conder-sm, mgerstgrasser, guidj, JiahaoYao, Zyiqin-Miranda, jvanheugten, aallahyar, SongGuyang, clarng, architkulkarni, Rohan138, heyitsmui, mattip, ArturNiederfahrenhorst, maxpumperla, vale981, krfricke, DmitriGekhtman, amogkam, richardliaw, maldil, zcin, jianoaix, cool-RR, kira-lin, gramhagen, c21, jiaodong, sijieamoy, tupui, ericl, anabranch, se4ml, suquark, dmatrix, jjyao, clarkzinzow, smorad, rkooo567, jovany-wang, edoakes, XiaodongLv, klieret, rozsasarpi, scottsun94, ijrsvt, bveeramani, chengscott, jbedorf, kevin85421, nikitavemuri, sven1977, acxz, stephanie-wang, PaulFenton, WangTaoTheTonic, cadedaniel, nthai, wuisawesome, rickyyx, artemisart, peytondmurray, pingsutw, olipinski, davidxia, stestagg, yaxife, scv119, mwtian, yuanchi2807, ntlm1686, shrekris-anyscale, cassidylaidlaw, gjoliver, ckw017, hakeemta, ilee300a, avivhaber, matthewdeng, afarid, pcmoritz, Chong-Li, Catch-Bull, justinvyu, iycheng

2.0.1

Not secure
The Ray 2.0.1 patch release contains dependency upgrades and fixes for multiple components:

- Upgrade grpcio version to 1.32 ([28025](https://github.com/ray-project/ray/pull/28025))
- Upgrade redis version to 7.0.5 ([28936](https://github.com/ray-project/ray/pull/28936))
- Fix segfault when using runtime environments ([28409](https://github.com/ray-project/ray/pull/28409))
- Increase RPC timeout for dashboard ([28330](https://github.com/ray-project/ray/pull/28330))
- Set correct path when using `python -m` ([28140](https://github.com/ray-project/ray/pull/28140))
- [Autoscaler] Fix autoscaling for 0 CPU head node ([26813](https://github.com/ray-project/ray/pull/26813))
- [Serve] Allow code in private remote Git URIs to be imported ([28250](https://github.com/ray-project/ray/pull/28250))
- [Serve] Allow `host` and `port` in Serve config ([27026](https://github.com/ray-project/ray/pull/27026))
- [RLlib] Evaluation supports asynchronous rollout (single slow eval worker will not block the overall evaluation progress). ([27390](https://github.com/ray-project/ray/pull/27390))
- [Tune] Fix hang during checkpoint synchronization ([28155](https://github.com/ray-project/ray/pull/28155))
- [Tune] Fix trial restoration from different IP ([28470](https://github.com/ray-project/ray/pull/28470))
- [Tune] Fix custom synchronizer serialization ([28699](https://github.com/ray-project/ray/pull/28699))
- [Workflows] Replace deprecated `name` option with `task_id` ([28151](https://github.com/ray-project/ray/pull/28151))

2.0.0

Not secure
Release Highlights

Ray 2.0 is an exciting release with enhancements to all libraries in the Ray ecosystem. With this major release, we take strides towards our goal of making distributed computing scalable, unified, and open.

Towards these goals, Ray 2.0 features new capabilities for unifying the machine learning (ML) ecosystem, improving Ray's production support, and making it easier than ever for ML practitioners to use Ray's libraries.

**Highlights:**



* [Ray AIR](https://docs.ray.io/en/releases-2.0.0/ray-air/getting-started.html), a scalable and unified toolkit for ML applications, is now in Beta.
* ​​Ray now supports [natively shuffling 100TB or more of data](https://docs.ray.io/en/releases-2.0.0/data/performance-tips.html#enabling-push-based-shuffle) with the Ray Datasets library.
* [KubeRay](https://docs.ray.io/en/releases-2.0.0/cluster/kubernetes/index.html), a toolkit for running Ray on Kubernetes, is now in Beta. This replaces the legacy Python-based Ray operator.
* [Ray Serve’s Deployment Graph API ](https://docs.ray.io/en/releases-2.0.0/serve/model_composition.html#serve-model-composition)is a new and easier way to build, test, and deploy an inference graph of deployments. This is released as Beta in 2.0.

A migration guide for all the different libraries can be found here: [Ray 2.0 Migration Guide](https://docs.google.com/document/d/12ODPbhEzeyDRUt8ehHDiKCFoxksPWJOUNEicGhNxtRg/edit#).


Ray Libraries


Ray AIR

Ray AIR is now in beta. Ray AIR builds upon Ray’s libraries to enable end-to-end machine learning workflows and applications on Ray. You can install all dependencies needed for Ray AIR via `pip install -u "ray[air]"`.

🎉 **New Features:**


* Predictors:
* BatchPredictors now have support for scalable inference on GPUs.
* All Predictors can now be constructed from pre-trained models, allowing you to easily scale batch inference with trained models from common ML frameworks.
* ray.ml.predictors has been moved to the Ray Train namespace (ray.train).
* Preprocessing: New preprocessors and API changes on Ray Datasets now make feature processing easier to do on AIR. See the Ray Data release notes for more details.
* New features for Datasets/Train/Tune/Serve can be found in the corresponding library release notes for more details.

💫 **Enhancements:**


* Major package refactoring is included in this release.
* ray.ml is renamed to ray.air.
* ray.ml.preprocessors have been moved to ray.data.
* train_test_split is now a new method of ray.data.Dataset (27065)
* ray.ml.trainers have been moved to ray.train (25570)
* ray.ml.predictors has been moved to ray.train.
* ray.ml.config has been moved to ray.air.config (25712).
* Checkpoints are now framework-specific -- meaning that each Trainer generates its own Framework-specific Checkpoint class. See Ray Train for more details.
* ModelWrappers have been renamed to PredictorDeployments.
* API stability annotations have been added (25485)
* Train/Tune now have the same reporting and checkpointing API -- see the Train notes for more details (26303)
* ScalingConfigs are now Dataclasses not Dict types
* Many AIR examples, benchmarks, and documentation pages were added in this release. The Ray AIR documentation will cover breadth of usage (end to end workflows across different libraries) while library-specific documentation will cover depth (specific features of a specific library).

🔨 **Fixes:**


* Many documentation examples were previously untested. This release fixes those examples and adds them to the CI.
* Predictors:
* Torch/Tensorflow Predictors have correctness fixes (25199, 25190, 25138, 25136)
* Update `KerasCallback` to work with `TensorflowPredictor` (26089)
* Add streaming BatchPredictor support (25693)
* Add `predict_pandas` implementation (25534)
* Add `_predict_arrow` interface for Predictor (25579)
* Allow creating Predictor directly from a UDF (26603)
* Execute GPU inference in a separate stage in BatchPredictor (26616, 27232, 27398)
* Accessors for preprocessor in Predictor class (26600)
* [AIR] Predictor `call_model` API for unsupported output types (26845)


Ray Data Processing

🎉 **New Features:**


* Add ImageFolderDatasource (24641)
* Add the NumPy batch format for batch mapping and batch consumption (24870)
* Add iter_torch_batches() and iter_tf_batches() APIs (26689)
* Add local shuffling API to iterators (26094)
* Add drop_columns() API (26200)
* Add randomize_block_order() API (25568)
* Add random_sample() API (24492)
* Add support for len(Dataset) (25152)
* Add UDF passthrough args to map_batches() (25613)
* Add Concatenator preprocessor (26526)
* Change range_arrow() API to range_table() (24704)

💫 **Enhancements:**


* Autodetect dataset parallelism based on available resources and data size (25883)
* Use polars for sorting (25454)
* Support tensor columns in to_tf() and to_torch() (24752)
* Add explicit resource allocation option via a top-level scheduling strategy (24438)
* Spread actor pool actors evenly across the cluster by default (25705)
* Add ray_remote_args to read_text() (23764)
* Add max_epoch argument to iter_epochs() (25263)
* Add Pandas-native groupby and sorting (26313)
* Support push-based shuffle in groupby operations (25910)
* More aggressive memory releasing for Dataset and DatasetPipeline (25461, 25820, 26902, 26650)
* Automatically cast tensor columns on Pandas UDF outputs (26924)
* Better error messages when reading from S3 (26619, 26669, 26789)
* Make dataset splitting more efficient and stable (26641, 26768, 26778)
* Use sampling to estimate in-memory data size for Parquet data source (26868)
* De-experimentalized lazy execution mode (26934)

🔨 **Fixes:**


* Fix pipeline pre-repeat caching (25265)
* Fix stats construction for from_*() APIs (25601)
* Fixes label tensor squeezing in to_tf() (25553)
* Fix stage fusion between equivalent resource args (fixes BatchPredictor) (25706)
* Fix tensor extension string formatting (repr) (25768)
* Workaround for unserializable Arrow JSON ReadOptions (25821)
* Make ActorPoolStrategy kill pool of actors if exception is raised (25803)
* Fix max number of actors for default actor pool strategy (26266)
* Fix byte size calculation for non-trivial tensors (25264)


Ray Train

Ray Train has received a major expansion of scope with Ray 2.0.

In particular, the Ray Train module now contains:

1. Trainers
2. Predictors
3. Checkpoints

for common different ML frameworks including Pytorch, Tensorflow, XGBoost, LightGBM, HuggingFace, and Scikit-Learn. These API help provide end-to-end usage of Ray libraries in Ray AIR workflows.

🎉 **New Features:**


* The Trainer API is now deprecated for the new Ray AIR Trainers API. Trainers for Pytorch, Tensorflow, Horovod, XGBoost, and LightGBM are now in Beta. (25570)
* ML framework-specific Predictors have been moved into the `ray.train` namespace. This provides streamlined API for offline and online inference of Pytorch, Tensorflow, XGBoost models and more. (25769 26215, 26251, 26451, 26531, 26600, 26603, 26616, 26845)
* ML framework-specific checkpoints are introduced. Checkpoints are consumed by Predictors to load model weights and information. (26777, 25940, 26532, 26534)

💫 **Enhancements:**


* Train and Tune now use the same reporting and checkpointing API (24772, 25558)
* Add tunable ScalingConfig dataclass (25712)
* Randomize block order by default to avoid hotspots (25870)
* Improve checkpoint configurability and extend results (25943)
* Improve prepare_data_loader to support multiple batch data types (26386)
* Discard returns of train loops in Trainers (26448)
* Clean up logs, reprs, warning s(26259, 26906, 26988, 27228, 27519)

📖 **Documentation:**


* Update documentation to use new Train API (25735)
* Update documentation to use session API (26051, 26303)
* Add Trainer user guide and update Trainer docs (27570, 27644, 27685)
* Add Predictor documentation (25833)
* Replace to_torch with iter_torch_batches (27656)
* Replace to_tf with iter_tf_batches (27768)
* Minor doc fixes (25773, 27955)

🏗 **Architecture refactoring:**


* Clean up ray.train package (25566)
* Mark Trainer interfaces as Deprecated (25573)

🔨 **Fixes:**


* An issue with GPU ID detection and assignment was fixed. (26493)
* Fix AMP for models with a custom `__getstate__` method (25335)
* Fix transformers example for multi-gpu (24832)
* Fix ScalingConfig key validation (25549)
* Fix ResourceChangingScheduler integration (26307)
* Fix auto_transfer cuda device (26819)
* Fix BatchPredictor.predict_pipelined not working with GPU stage (27398)
* Remove rllib dependency from tensorflow_predictor (27688)


Ray Tune

🎉 **New Features:**


* The Tuner API is the new way of running Ray Tune experiments. (26987, 26987, 26961, 26931, 26884, 26930)
* Ray Tune and Ray Train now have the same API for reporting (25558)
* Introduce tune.with_resources() to specify function trainable resources (26830)
* Add Tune benchmark for AIR (26763, 26564)
* Allow Tuner().restore() from cloud URIs (26963)
* Add top-level imports for Tuner, TuneConfig, move CheckpointConfig (26882)
* Add resume experiment options to Tuner.restore() (26826)
* Add checkpoint_frequency/checkpoint_at_end arguments to CheckpointConfig (26661)
* Add more config arguments to Tuner (26656)
* Better error message for Tune nested tasks / actors (25241)
* Allow iterators in tune.grid_search (25220)
* Add `get_dataframe()` method to result grid, fix config flattening (24686)

💫 **Enhancements:**


* Expose number of errored/terminated trials in ResultGrid (26655)
* remove fully_executed from Tune. (25750)
* Exclude in remote storage upload (25544)
* Add `TempFileLock` (25408)
* Add annotations/set scope for Tune classes (25077)

📖 **Documentation:**


* Improve Tune + Datasets documentation (25389)
* Tune examples better navigation, minor fixes (24733)

🏗 **Architecture refactoring:**


* Consolidate checkpoint manager 3: Ray Tune (24430)
* Clean up ray.tune scope (remove stale objects in __all__) (26829)

🔨 **Fixes:**


* Fix k8s release test + node-to-node syncing (27365)
* Fix Tune custom syncer example (27253)
* Fix tune_cloud_aws_durable_upload_rllib_* release tests (27180)
* Fix test_tune (26721)
* Larger head node for tune_scalability_network_overhead weekly test (26742)
* Fix tune-sklearn notebook example (26470)
* Fix reference to `dataset_tune` (25402)
* Fix Tune-Pytorch-CIFAR notebook example (26474)
* Fix documentation testing (26409)
* Fix `set_tune_experiment` (26298)
* Fix GRPC resource exhausted test for tune trainables (24467)


Ray Serve

🎉 **New Features:**


* We are excited to introduce you to the 2.0 API centered around multi-model composition API, operation API, and production stability. (26310,26507,26217,25932,26374,26901,27058,24549,24616,27479,27576,27433,24306,25651,26682,26521,27194,27206,26804,25575,26574)
* Deployment Graph API is the new API for model composition. It provides a declarative layer on top of the 1.x deployment API to help you author performant inference pipeline easily. (27417,27420,24754,24435,24630,26573,27349,24404,25424,24418,27815,27844,25453,24629)
* We introduced a new [K8s native way to deploy Ray Serve](https://ray-project.github.io/kuberay/guidance/rayservice/). Along with a brand new REST API to perform deployment, update, and configure. (#25935,27063,24814,26093,25213,26588,25073,27000,27444,26578,26652,25610,25502,26096,24265,26177,25861,25691,24839,27498,27561,25862,26347)
* Serve can now survive Ray GCS failure. This used to be a single-point-of-failure in Ray Serve's architecture. Now, when the GCS goes down, Serve can continue to Serve traffic. We recommend you to [try out this feature](https://ray-project.github.io/kuberay/guidance/gcs-ft/) and give us feedback! (#25633,26107,27608,27763,27771,25478,25637,27526,27674,26753,26797,24560,26685,26734,25987,25091,24934)
* Autoscaling has been promoted to stable. Additionally, we added a scale to zero support. (25770,25733,24892,26393)
* The documentation has been revamped. Check them at rayserve.org (24414,26211,25786,25936,26029,25830,24760,24871,25243,25390,25646,24657,24713,25270,25808,24693,24736,24524,24690,25494)

💫 **Enhancements:**


* Serve natively supports deploying predictor and checkpoints from Ray AI Runtime (26026,25003,25537,25609,25962,26494,25688,24512,24417)
* Serve now supports scaling Gradio application (27560)
* Java Client API, marking the complete alpha release Java API (22726)
* Improved out-of-box performance by using uvicorn with uvloop (25027)


RLlib

🎉 **New Features:**


* In 2.0, RLlib is introducing an object-oriented configuration API instead of using a python dict for algorithm configuration (24332, 24374, 24375, 24376, 24433, 24576, 24650, 24577, 24339, 24687, 24775, 24584, 24583, 24853, 25028, 25059, 25065, 25066, 25067, 25256, 25255, 25278, 25279)
* RLlib is introducing a Connectors API (alpha). Connectors are a new component that handles transformations on inputs and outputs of a given RL policy. (25311, 25007, 25923, 25922, 25954, 26253, 26510, 26645, 26836, 26803, 26998, 27016)
* New improvements to off-policy estimators, including a new Doubly-Robust Off-Policy Estimator implementation (24384, 25107, 25056, 25899, 25911, 26279, 26893)
* CRR Algorithm (25459, 25667, 25905, 26142, 26304, 26770, 27161)
* Feature importance evaluation for offline RL (26412)
* RE3 exploration algorithm TF2 framework support (25221)
* Unified replay Buffer API (24212, 24156, 24473, 24506, 24866, 24683, 25841, 25560, 26428)

💫 **Enhancements:**


* Improvements to RolloutWorker / Env fault tolerance (24967, 26134, 26276, 26809)
* Upgrade gym to 0.23 (24171), Bump gym dep to 0.24 (26190)
* Agents has been renamed to Algorithms (24511, 24516, 24739, 24797, 24841, 24896, 25014, 24579, 25314, 25346, 25366, 25539, 25869)
* Execution Plan API is now deprecated. Training step function API is the new way of specifying RLlib algorithms (23454, 24488, 2450, 24212, 24165, 24545, 24507, 25076, 25624, 25924, 25856, 25851, 27344, 24423)
* Policy V2 subclassing implementation migration (24742, 24746, 24914, 25117, 25203, 25078, 25254, 25384, 25585, 25871, 25956, 26054)
* Allow passing **kwargs to action distribution. (24692)
* Deprecation: Replace remaining evaluation_num_episodes with `evaluation_duration`. (26000)

🔨 **Fixes:**


* Multi-GPU learner thread key error in MA-scenarios (24382)
* Add release learning tests for SlateQ (24429)
* APEX-DQN replay buffer config validation fix. (24588)
* Automatic sequencing in function timeslice_along_seq_lens_with_overlap (24561)
* Policy Server/Client metrics reporting fix (24783)
* Re-establish dashboard performance tests. (24728)
* Bandit tf2 fix (+ add tf2 to test cases). (24908)
* Fix estimated buffer size in replay buffers. (24848)
* Fix RNNSAC example failing on CI + fixes for recurrent models for other Q Learning Algos. (24923)
* Curiosity bug fix. (24880)
* Auto-infer different agents' spaces in multi-agent env. (24649)
* Fix the bug “WorkerSet.stop() will raise error if `self._local_worker` is None (e.g. in evaluation worker sets)”. (25332)
* Fix Policy global timesteps being off by init sample batch size. (25349)
* Disambiguate timestep fragment storage unit in replay buffers. (25242)
* Fix the bug where on GPU, sample_batch.to_device() only converts the device and does not convert float64 to float32. (25460)
* Fix faulty usage of get_filter_config in ComplexInputNextwork` (25493)`
* Custom resources per worker should get added to default_resource_request (24463)
* Better default values for training_intensity and `target_network_update_freq` for R2D2. (25510)
* Fix multi agent environment checks for observations that contain only some agents' obs each step. (25506)
* Fixes PyTorch grad clipping logic and adds grad clipping to QMIX. (25584)
* Discussion 6432: Automatic train_batch_size calculation fix. (25621)
* Added meaningful error for multi-agent failure of SampleCollector in case no agent steps in episode. (25596)
* Replace torch.range with torch.arange. (25640)\
* Fix the bug where there is no gradient clipping in QMix. (25656[)](https://github.com/ray-project/ray/commit/c3645928caf8495a2849e21f1bf0e131409d9f99)
* Fix sample batch concatination. (25572)
* Fix action_sampler_fn call in TorchPolicyV2 (obs_batch instead of `input_dict` arg). (25877)
* Fixes logging of all of RLlib's Algorithm names as warning messages. (25840)
* IMPALA/APPO multi-agent mix-in-buffer fixes (plus MA learningt ests). (25848)
* Move offline input into replay buffer using rollout ops in CQL. (25629)
* Include SampleBatch.T column in all collected batches. (25926)
* Add timeout to filter synchronization. (25959)
* SimpleQ PyTorch Multi GPU fix (26109)
* IMPALA and APPO metrics fixes; remove deprecated `async_parallel_requests` utility. (26117)
* Added 'episode.hist_data' to the 'atari_metrics' to nsure that custom metrics of the user are kept in postprocessing when using Atari environments. (25292)
* Make the dataset and json readers batchable (26055)
* Fix Issue 25696: Output writers not working w/ multiple workers. (25722)
* Fix all the erroneous on_trainer_init warning. (26433)
* In env check, step only expected agents. (26425)
* Make DQN update_target use only trainable variables. (25226)
* Fix FQE Policy call (26671)
* Make queue placement ops blocking (26581)
* Fix memory leak in APEX_DQN (26691)
* Fix MultiDiscrete not being one-hotted correctly (26558)
* Make IOContext optional for DatasetReader (26694)
* Make sure we step() after adding init_obs. (26827)
* Fix ModelCatalog for nested complex inputs (25620)
* Use compress observations where replay buffers and image obs are used in tuned examples (26735)
* Fix SampleBatch.split_by_episode to use dones if episode id is not available (26492)
* Fix torch None conversion in `torch_utils.py::convert_to_torch_tensor`. (26863)
* Unify gnorm mixin for tf and torch policies. (26102)


Ray Workflows

🎉 **New Features:**


* Support ray client (26702)
* Http event is supported (26010)
* Support retry_exceptions (26913)
* Support queuing in workflow (24697)
* Make status indexed (24767)

🔨 **Fixes:**


* Push logs to drivers correctly (24490)
* Make resume no side effect (26918)
* Make the max_retries aligned with ray (26350)

🏗 **Architecture refactoring:**


* Rewrite workflow execution engine (25618)
* Simplify the resume flow (24594)
* Deprecate step and use bind (26232)
* Deprecate virtual actor (25394)
* Refactor the exception processing (26398)


Ray Core and Ray Clusters


Ray Core

🎉 **New Features:**


* Ray State API is now at alpha. You can access the live information of tasks, actors, objects, placement groups, and etc. through Ray CLI (summary / list / get) and Python SDK. See the [Ray State API documentation](https://docs.ray.io/en/master/ray-observability/state/state-api.html) for more information.
* Support generators for tasks with multiple return values (25247)
* Support GCS Fault tolerance.(24764, 24813, 24887, 25131, 25126, 24747, 25789, 25975, 25994, 26405, 26421, 26919)

💫 **Enhancements:**


* Allow failing new tasks immediately while the actor is restarting (22818)
* Add more accurate worker exit (24468)
* Allow user to override global default for max_retries (25189)
* Export additional metrics for workers and Raylet memory (25418)
* Push message to driver when a Raylet dies (25516)
* Out of Disk prevention (25370)
* ray.init defaults to an existing Ray instance if there is one (26678)
* Reconstruct manually freed objects (27567)

🔨 **Fixes:**



* Fix a task cancel hanging bug (24369)
* Adjust worker OOM scores to prioritize the raylet during memory pressure (24623)
* Fix pull manager deadlock due to object reconstruction (24791)
* Fix bugs in data locality aware scheduling (25092)
* Fix node affinity strategy when resource is empty (25344)
* Fix object transfer resend protocol (26349)

🏗 **Architecture refactoring:**



* Raylet and GCS schedulers share the same code (23829)
* Remove multiple core workers in one process (24147, 25159)


Ray Clusters

🎉 **New Features:**


* The KubeRay operator is now the preferred tool to run Ray on Kubernetes.
* Ray Autoscaler + KubeRay operator integration is now beta.

💫 **Enhancements:**


* Check out the [newly revamped docs](https://docs.ray.io/en/releases-2.0.0/cluster/getting-started.html)!

🔨 **Fixes:**



* Previously deprecated fields, `head_node`, `worker_nodes`, `head_node_type`, `default_worker_node_type`, `autoscaling_mode`, `target_utilization_fraction` are removed. Check out the [migration guide ](https://docs.google.com/document/d/1Rz-UGz-RHK6iKSX3xt2VioQnX-kC20d5n8pbRN7KGME/edit#)to learn how to migrate to the new versions.


Ray Client

🎉 **New Features:**


* Support for configuring request metadata for client gRPC (24946)

💫 **Enhancements:**


* Remove 2 GiB size limit on remote function arguments (24555)

🔨 **Fixes:**


* Fix excessive memory usage when submitting large remote arguments (24477)


Dashboard

🎉 **New Features:**



* The new dashboard UI is now to default dashboard. Please leave any feedback about the dashboard on Github Issues or Discourse! You can still go to the legacy dashboard UI by clicking “Back to legacy dashboard”.
* New Dashboard UI now shows all ray jobs. This includes jobs submitted via the job submission API and jobs launched from python scripts via ray.init().
* New Dashboard UI now shows worker nodes in the main node tab
* New Dashboard UI now shows more information in the actors tab

Breaking changes:



* The job submission list_jobs API endpoint, CLI command, and SDK function now returns a list of jobs instead of a dictionary from id to job.
* The Tune tab is no longer in the new dashboard UI. It is still available in the legacy dashboard UI but will be removed.
* The memory tab is no longer in the new dashboard UI. It is still available in the legacy dashboard UI but will be removed.

🔨 **Fixes:**


* We reduced the memory usage of the dashboard. We are no longer caching logs and we cache a maximum of 1000 actors. As a result of this change, node level logs can no longer be accessed in the legacy dashboard.
* Jobs status error message now properly truncates logs to 10 lines. We also added a max characters of 20000 to avoid passing too much data.

Many thanks to all those who contributed to this release!

ujvl, xwjiang2010, EricCousineau-TRI, ijrsvt, waleedkadous, captain-pool, olipinski, danielwen002, amogkam, bveeramani, kouroshHakha, jjyao, larrylian, goswamig, hanming-lu, edoakes, nikitavemuri, enori, grechaw, truelegion47, alanwguo, sychen52, ArturNiederfahrenhorst, pcmoritz, mwtian, vakker, c21, rberenguel, mattip, robertnishihara, cool-RR, iamhatesz, ofey404, raulchen, nmatare, peterghaddad, n30111, fkaleo, Riatre, zhe-thoughts, lchu-ibm, YoelShoshan, Catch-Bull, matthewdeng, VishDev12, valtab, maxpumperla, tomsunelite, fwitter, liuyang-my, peytondmurray, clarkzinzow, VeronikaPolakova, sven1977, stephanie-wang, emjames, Nintorac, suquark, javi-redondo, xiurobert, smorad, brucez-anyscale, pdames, jjyyxx, dmatrix, nakamasato, richardliaw, juliusfrost, anabranch, christy, Rohan138, cadedaniel, simon-mo, mavroudisv, guidj, rkooo567, orcahmlee, lixin-wei, neigh80, yuduber, JiahaoYao, simonsays1980, gjoliver, jimthompson5802, lucasalavapena, zcin, clarng, jbn, DmitriGekhtman, timgates42, charlesjsun, Yard1, mgelbart, wumuzi520, sihanwang41, ghost, jovany-wang, siavash119, yuanchi2807, tupui, jianoaix, sumanthratna, code-review-doctor, Chong-Li, FedericoGarza, ckw017, Makan-Ar, kfstorm, flanaman, WangTaoTheTonic, franklsf95, scv119, kvaithin, wuisawesome, jiaodong, mgerstgrasser, tiangolo, architkulkarni, MyeongKim, ericl, SongGuyang, avnishn, chengscott, shrekris-anyscale, Alyetama, iycheng, rickyyx, krfricke, sijieamoy, kimikuri, czgdp1807, michalsustr

1.13.0

Not secure
Highlights:

- Python 3.10 support is now in alpha.
- Ray [usage stats collection](https://docs.ray.io/en/master/cluster/usage-stats.html) is now on by default (guarded by an opt-out prompt).
- Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
- Ray Workflow comes with a new API and is integrated with Ray DAG.

Ray Autoscaler
💫Enhancements:

- CI tests for KubeRay autoscaler integration (23365, 23383, 24195)
- Stability enhancements for KubeRay autoscaler integration (23428)

🔨 Fixes:

- Improved GPU support in KubeRay autoscaler integration (23383)
- Resources scheduled with the node affinity strategy are not reported to the autoscaler (24250)

Ray Client
💫Enhancements:

- Add option to configure ray.get with >2 sec timeout (22165)
- Return `None` from internal KV for non-existent keys (24058)

🔨 Fixes:

- Fix deadlock by switching to `SimpleQueue` on Python 3.7 and newer in async `dataclient` (23995)

Ray Core
🎉 New Features:

- Ray [usage stats collection](https://docs.ray.io/en/master/cluster/usage-stats.html) is now on by default (guarded by an opt-out prompt)
- Alpha support for python 3.10 (on Linux and Mac)
- Node affinity scheduling strategy (23381)
- Add metrics for disk and network I/O (23546)
- Improve exponential backoff when connecting to the redis (24150)
- Add the ability to inject a setup hook for customization of runtime_env on init (24036)
- Add a utility to check GCS / Ray cluster health (23382)

🔨 Fixes:

- Fixed internal storage S3 bugs (24167)
- Ensure "get_if_exists" takes effect in the decorator. (24287)
- Reduce memory usage for Pubsub channels that do not require total memory cap (23985)
- Add memory buffer limit in publisher for each subscribed entity (23707)
- Use gRPC instead of socket for GCS client health check (23939)
- Trim size of Reference struct (23853)
- Enable debugging into pickle backend (23854)

🏗 Architecture refactoring:

- Gcs storage interfaces unification (24211)
- Cleanup pickle5 version check (23885)
- Simplify options handling (23882)
- Moved function and actor importer away from pubsub (24132)
- Replace the legacy ResourceSet & SchedulingResources at Raylet (23173)
- Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (23872)
- Save task spec in separate table (22650)


Ray Datasets
🎉 New Features:

- Performance improvement: the aggregation computation is vectorized (23478)
- Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (23179)
- Performance improvement: more efficient move semantics for Datasets block processing (24127)
- Supports Datasets lineage serialization (aka out-of-band serialization) (23821, 23931, 23932)
- Supports native Tensor views in map processing for pure-tensor datasets (24812)
- Implemented push-based shuffle (24281)

🔨 Fixes:

- Documentation improvement: Getting Started page (24860)
- Documentation improvement: FAQ (24932)
- Documentation improvement: End to end examples (24874)
- Documentation improvement: Feature guide - Creating Datasets (24831)
- Documentation improvement: Feature guide - Saving Datasets (24987)
- Documentation improvement: Feature guide - Transforming Datasets (25033)
- Documentation improvement: Datasets APIs docstrings (24949)
- Performance: fixed block prefetching (23952)
- Fixed zip() for Pandas dataset (23532)

🏗 Architecture refactoring:

- Refactored LazyBlockList (23624)
- Added path-partitioning support for all content types (23624)
- Added fast metadata provider and refactored Parquet datasource (24094)

RLlib
🎉 New Features:

- Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (24164, 22842, 23523, 23586)

🏗 Architecture refactoring:

- More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (22937, 23420, 23673, 24164, 24151, 23735, 24157, 23798, 23906, 24118, 22842, 24166, 23712). This will be fully completed and documented with Ray 2.0.
- Make RolloutWorkers (optionally) recoverable after failure via the new `recreate_failed_workers=True` config flag. (23739)
- POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (24295, 23491)
- Hard-deprecate `build_trainer()` (trainer_templates.py): All custom Trainers should now sub-class from any existing `Trainer` class. (23488)

💫Enhancements:

- Add support for complex observations in CQL. (23332)
- Bandit support for tf2. (22838)
- Make actions sent by RLlib to the env immutable. (24262)
- Memory leak finding toolset using tracemalloc + CI memory leak tests. (15412)
- Enable DD-PPO to run on Windows. (23673)

🔨 Fixes:

- APPO eager fix (APPOTFPolicy gets wrapped `as_eager()` twice by mistake). (24268)
- CQL gets stuck when deprecated `timesteps_per_iteration` is used (use `min_train_timesteps_per_reporting` instead). (24345)
- SlateQ runs on GPU (torch). (23464)
- Other bug fixes: 24016, 22050, 23814, 24025, 23740, 23741, 24006, 24005, 24273, 22010, 24271, 23690, 24343, 23419, 23830, 24335, 24148, 21735, 24214, 23818, 24429

Ray Workflow
🎉 New Features:

- Workflow step is deprecated (23796, 23728, 23456, 24210)

🔨 Fixes:

- Fix one bug where max_retries is not aligned with ray core’s max_retries. (22903)

🏗 Architecture refactoring:

- Integrate ray storage in workflow (24120)

Tune
🎉 New Features:

- Add RemoteTask based sync client (23605) (rsync not required anymore!)
- Chunk file transfers in cross-node checkpoint syncing (23804)
- Also interrupt training when SIGUSR1 received (24015)
- reuse_actors per default for function trainables (24040)
- Enable AsyncHyperband to continue training for last trials after max_t (24222)

💫Enhancements:

- Improve testing (23229
- Improve docstrings (23375)
- Improve documentation (23477, 23924)
- Simplify trial executor logic (23396
- Make `MLflowLoggerUtil` copyable (23333)
- Use new Checkpoint interface internally (22801)
- Beautify Optional typehints (23692)
- Improve missing search dependency info (23691)
- Skip tmp checkpoints in analysis and read iteration from metadata (23859)
- Treat checkpoints with nan value as worst (23862)
- Clean up base ProgressReporter API (24010)
- De-clutter log outputs in trial runner (24257)
- hyperopt searcher to support tune.choice([[1,2],[3,4]]). (24181)

🔨Fixes:

- Optuna should ignore additional results after trial termination (23495)
- Fix PTL multi GPU link (23589)
- Improve Tune cloud release tests for durable storage (23277)
- Fix tensorflow distributed trainable docstring (23590)
- Simplify experiment tag formatting, clean directory names (23672)
- Don't include nan metrics for best checkpoint (23820)
- Fix syncing between nodes in placement groups (23864)
- Fix memory resources for head bundle (23861)
- Fix empty CSV headers on trial restart (23860)
- Fix checkpoint sorting with nan values (23909)
- Make Timeout stopper work after restoring in the future (24217)
- Small fixes to tune-distributed for new restore modes (24220)

Train
**Most distributed training enhancements will be captured in the new Ray AIR category!**

🔨Fixes:

- Copy resources_per_worker to avoid modifying user input
- Fix `train.torch.get_device()` for fractional GPU or multiple GPU per worker case (23763)
- Fix multi node horovod bug (22564)
- Fully deprecate Ray SGD v1 (24038)
- Improvements to fault tolerance (22511)
- MLflow start run under correct experiment (23662)
- Raise helpful error when required backend isn't installed (23583)
- Warn pending deprecation for `ray.train.Trainer` and `ray.tune` DistributedTrainableCreators (24056)

📖Documentation:

- add FAQ (22757)

Ray AIR
🎉 New Features:

- `HuggingFaceTrainer` & `HuggingFacePredictor` (23615, 23876)
- `SklearnTrainer` & `SklearnPredictor` (23803, 23850)
- `HorovodTrainer` (23437)
- `RLTrainer` & `RLPredictor` (23465, 24172)
- `BatchMapper` preprocessor (23700)
- `Categorizer` preprocessor (24180)
- `BatchPredictor` (23808)

💫Enhancements:

- Add `Checkpoint.as_directory()` for efficient checkpoint fs processing (23908)
- Add `config` to `Result`, extend `ResultGrid.get_best_config` (23698)
- Add Scaling Config validation (23889)
- Add tuner test. (23364)
- Move storage handling to pyarrow.fs.FileSystem (23370)
- Refactor `_get_unique_value_indices` (24144)
- Refactor `most_frequent` `SimpleImputer` (23706)
- Set name of Trainable to match with Trainer 23697
- Use checkpoint.as_directory() instead of cleaning up manually (24113)
- Improve file packing/unpacking (23621)
- Make Dataset ingest configurable (24066)
- Remove postprocess_checkpoint (24297)

🔨Fixes:

- Better exception handling (23695)
- Do not deepcopy RunConfig (23499)
- reduce unnecessary stacktrace (23475)
- Tuner should use `run_config` from Trainer per default (24079)
- Use custom fsspec handler for GS (24008)

📖Documentation:

- Add distributed `torch_geometric` example (23580)
- GNN example cleanup (24080)

Serve
🎉 New Features:

- Serve logging system was revamped! Access log is now turned on by default. (23558)
- New Gradio notebook example for Ray Serve deployments (23494)
- Serve now includes full traceback in deployment update error message (23752)

💫Enhancements:

- Serve Deployment Graph was enhanced with performance fixes and structural clean up. (24199, 24026, 24065, 23984)
- End to end tutorial for deployment graph (23512, 22771, 23536)
- `input_schema` is now renamed as `http_adapter` for usability (24353, 24191)
- Progress towards a declarative REST API (23232, 23481)
- Code cleanup and refactoring (24067, 23578, 23934, 23759)
- Protobuf based controller API for cross language client (23004)

🔨Fixes:

- Handle `None` in `ReplicaConfig`'s `resource_dict` (23851)
- Set `"memory"` to `None` in `ray_actor_options` by default (23619)
- Make `serve.shutdown()` shutdown remote Serve applications (23476)
- Ensure replica reconfigure runs after allocation check (24052)
- Allow cloudpickle serializable objects as init args/kwargs (24034)
- Use controller namespace when getting actors (23896)

Dashboard
🔨Fixes:

- Add toggle to enable showing node disk usage on K8s (24416, 24440)
- Add job submission id as field to job snapshot (24303)


Thanks
Many thanks to all those who contributed to this release!
matthewdeng, scv119, xychu, iycheng, takeshi-yoshimura, iasoon, wumuzi520, thetwotravelers, maxpumperla, krfricke, jgiannuzzi, kinalmehta, avnishn, dependabot[bot], sven1977, raulchen, acxz, stephanie-wang, mgelbart, xwjiang2010, jon-chuang, pdames, ericl, edoakes, gjoseph92, ddelange, bkasper, sriram-anyscale, Zyiqin-Miranda, rkooo567, jbedorf, architkulkarni, osanseviero, simonsays1980, clarkzinzow, DmitriGekhtman, ashione, smorad, andenrx, mattip, bveeramani, chaokunyang, richardliaw, larrylian, Chong-Li, fwitter, shrekris-anyscale, gjoliver, simontindemans, silky, grypesc, ijrsvt, daikeshi, kouroshHakha, mwtian, mesjou, sihanwang41, PavelCz, czgdp1807, jianoaix, GuillaumeDesforges, pcmoritz, arsedler9, n30111, kira-lin, ckw017, max0x7ba, Yard1, XuehaiPan, lchu-ibm, HJasperson, SongGuyang, amogkam, liuyang-my, WangTaoTheTonic, jovany-wang, simon-mo, dynamicwebpaige, suquark, ArturNiederfahrenhorst, jjyao, KepingYan, jiaodong, frosk1

1.12.1

Not secure
Patch release with the following fixes:

- **Ray now works on Google Colab again!** The bug with memory limit fetching when running Ray in a container is now fixed (https://github.com/ray-project/ray/pull/23922).
- `ray-ml` Docker images for CPU will start being built again after they were stopped in Ray 1.9 (https://github.com/ray-project/ray/pull/24266).
- [Train/Tune] Start MLflow run under the correct experiment for Ray Train and Ray Tune integrations (https://github.com/ray-project/ray/pull/23662).
- [RLlib] Fix for APPO in eager mode (https://github.com/ray-project/ray/pull/24268).
- [RLlib] Fix Alphastar for TF2 and tracing enabled (https://github.com/ray-project/ray/commit/c5502b2aa57376b26408bb297ff68696c16f48f1).
- [Serve] Fix replica leak in anonymous namespaces (https://github.com/ray-project/ray/pull/24311).

1.12.0

Not secure
Highlights
- Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the [docs](https://docs.ray.io/en/master/ray-air/getting-started.html) or on the [public RFC](https://github.com/ray-project/ray/issues/22488).
- Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this [short form](https://forms.gle/wCCdbaQDtgErYycT6)!
- Ray [usage data collection](https://github.com/ray-project/ray/issues/20857) is now off by default. If you have any questions or concerns, please comment [on the RFC](https://github.com/ray-project/ray/issues/20857).
- New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
- Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.


Ray Autoscaler

🎉 New Features
- Support cache_stopped_nodes on Azure (21747)
- AWS Cloudwatch support (21523)

💫 Enhancements
- Improved documentation and standards around built in autoscaler node providers. (22236, 22237)
- Improved KubeRay support (22987, 22847, 22348, 22188)
- Remove redis requirement (22083)

🔨 Fixes
- No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (22235)
- Default ami’s per AWS region are updated/fixed. (22506)
- GCP node termination updated (23101)
- Retry legacy k8s operator on monitor failure (22792)
- Cap min and max workers for manually managed on-prem clusters (21710)
- Fix initialization artifacts (22570)
- Ensure initial scaleup with high upscaling_speed isn't limited. (21953)

Ray Client

🎉 New Features:
- ray.init has consistent return value in client mode and driver mode 21355

💫Enhancements:
- Gets and puts are streamed to support arbitrary object sizes 22100, 22327

🔨 Fixes:
- Fix ray client object ref releasing in wrong context 22025


Ray Core

🎉 New Features
- RuntimeEnv:
- Support setting timeout for runtime_env setup. (23082)
- Support setting pip_check and pip_version for runtime_env. (22826, 23306)
- env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (22730)
- Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (22522)
- Introduce [virtualenv](https://github.com/pypa/virtualenv) to isolate the pip type runtime env. (#21801,22309)
- Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (22382,23024)
- Enable dashboard in the minimal ray installation (21896)
- Add task and object reconstruction status to ray memory cli tools(22317)

🔨 Fixes
- Report only memory usage of pinned object copies to improve scaledown. (22020)
- Scheduler:
- No spreading if a node is selected for lease request due to locality. (22015)
- Placement group scheduling: Non-STRICT_PACK PGs should be sorted by resource priority, size (22762)
- Round robin during spread scheduling (21303)
- Object store:
- Increment ref count when creating an ObjectRef to prevent object from going out of scope (22120)
- Cleanup handling for nondeterministic object size during transfer (22639)
- Fix bug in fusion for spilled objects (22571)
- Handle IO worker failures correctly (20752)
- Improve ray stop behavior (22159)
- Avoid warning when receiving too much logs from a different job (22102)
- Gcs resource manager bug fix and clean up. (22462, 22459)
- Release GIL when running `parallel_memcopy()` / `memcpy()` during serializations. (22492)
- Fix registering serializer before initializing Ray. (23031)

🏗 Architecture refactoring
- Ray distributed scheduler refactoring: (21927, 21992, 22160, 22359, 22722, 22817, 22880, 22893, 22885, 22597, 22857, 23124)
- Removed support for bootstrapping with Redis.


Ray Data Processing
🎉 New Features
- Big Performance and Stability Improvements:
- Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (22233, 22374, 22373, 22476)
- Support for random access datasets, providing efficient random access to rows via binary search (22749)
- Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the `_spread_resource_prefix` hack (21303)
- More Efficient Tabular Data Wrangling:
- Add first-class support for Pandas blocks, removing expensive Arrow <-> Pandas conversion costs (21894)
- Expose `TableRow` API + minimize copies/type-conversions on row-based ops (22305)
- Groupby + Aggregations Improvements:
- Support mapping over groupby groups (22715)
- Support ignoring nulls in aggregations (20787)
- Improved Dataset Windowing:
- Support windowing a dataset by bytes instead of number of blocks (22577)
- Batch across windows in `DatasetPipeline`s (22830)
- Better Text I/O:
- Support streaming snappy compression for text files (22486)
- Allow for custom decoding error handling in `read_text()` (21967)
- Add option for dropping empty lines in `read_text()` (22298)
- New Operations:
- Add `add_column()` utility for adding derived columns (21967)
- Support for metadata provider callback for read APIs (22896)
- Support configuring autoscaling actor pool size (22574)

🔨 Fixes
- Force lazy datasource materialization in order to respect `DatasetPipeline` stage boundaries (21970)
- Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (22007)
- Respect 0 CPU resource request when using manual resource-based load balancing (22017)
- Remove batch format ambiguity by always converting Arrow batches to Pandas when `batch_format=”native”` is given (21566)
- Fix leaked stats actor handle due to closure capture reference counting bug (22156)
- Fix boolean tensor column representation and slicing (22323)
- Fix unhandled empty block edge case in shuffle (22367)
- Fix unserializable Arrow Partitioning spec (22477)
- Fix incorrect `iter_epochs()` batch format (22550)
- Fix infinite `iter_epochs()` loop on unconsumed epochs (22572)
- Fix infinite hang on `split()` when `num_shards < num_rows` (22559)
- Patch Parquet file fragment serialization to prevent metadata fetching (22665)
- Don’t reuse task workers for actors or GPU tasks (22482)
- Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (​​22715)
- Always use non-empty blocks to determine schema (22834)
- API fix bash (22886)
- Make label_column optional for `to_tf()` so it can be used for inference (22916)
- Fix `schema()` for `DatasetPipeline`s (23032)
- Fix equalized split when `num_splits == num_blocks` (23191)

💫 Enhancements
- Optimize Parquet metadata serialization via batching (21963)
- Optimize metadata read/write for Ray Client (21939)
- Add sanity checks for memory utilization (22642)

🏗 Architecture refactoring
- Use threadpool to submit `DatasetPipeline` stages (22912)

RLlib

🎉 New Features
- New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (21356, 21649)
- SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (22389, 23276, 22544, 22543, 23168, 21827, 22738)
- Bandit algorithms: Moved into `agents` folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (22821, 22028, 22427, 22465, 21949, 21773, 21932, 22421)
- ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (22114, 22390, 21808)
- Datasets support for RLlib: Dataset Reader/Writer and documentation. (21808, 22239, 21948)

🔨 Fixes
- Fixed memory leak in SimpleReplayBuffer. (22678)
- Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (22247)
- Various bug fixes. (22350, 22245, 22171, 21697, 21855, 22076, 22590, 22587, 22657, 22428, 23063, 22619, 22731, 22534, 22074, 22078, 22641, 22684, 22398, 21685)

🏗 Architecture refactoring
- A3C: Moved into new `training_iteration` API (from `exeution_plan` API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (22126, 22316)
- Make `multiagent->policies_to_train` more flexible via callable option (alternative to providing a list of policy IDs). (20735)

💫Enhancements:
- Env pre-checking module now active by default. (22191)
- Callbacks: Added `on_sub_environment_created` and `on_trainer_init` callback options. (21893, 22493)
- RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (22028, 21773, 22211)
- MARWIL loss function enhancement (exploratory term for stddev). (21493)

📖Documentation:
- Docs enhancements: Setup-dev instructions; Ray datasets integration. (22239)
- Other doc enhancements and fixes. (23160, 23226, 22496, 22489, 22380)


Ray Workflow
🎉 New Features:
- Support skip checkpointing.

🔨 Fixes:
- Fix an issue where the event loop is not set.


Tune
🎉 New Features:
- Expose new checkpoint interface to users ([22741](https://github.com/ray-project/ray/pull/22741))

💫Enhancements:
- Better error msg for grpc resource exhausted error. (22806)
- Add CV support for XGB/LGBM Tune callbacks (22882)
- Make sure tune.run can run inside worker thread (https://github.com/ray-project/ray/commit/b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)[#22566](https://github.com/ray-project/ray/pull/22566)[)](https://github.com/ray-project/ray/commit/b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)
- Add Trainable.postprocess_checkpoint (22973)
[Trainables will now know TUNE_ORIG_WORKING_DIR (](https://github.com/ray-project/ray/commit/f5995dccdf0ab4012e511c3379b19f06f1d307b5)[#22803](https://github.com/ray-project/ray/pull/22803)[)](https://github.com/ray-project/ray/commit/f5995dccdf0ab4012e511c3379b19f06f1d307b5)
- Retry cloud sync up/down/delete on fail (22029)
- Support functools.partial names and treat as function in registry (21518)

🔨Fixes:
- [Cleanup incorrectly formatted strings (Part 2: Tune) (](https://github.com/ray-project/ray/commit/761f927720586403c642dc62fa510c033fd7ffd5)[#23129](https://github.com/ray-project/ray/pull/23129)[)](https://github.com/ray-project/ray/commit/761f927720586403c642dc62fa510c033fd7ffd5)
- fix error handling for fail_fast case. (22982)
- Remove Trainable.update_resources (22471)
- Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (22071)
- Fix analysis without registered trainable (21475)
- Update Lightning examples to support PTL 1.5 (20562)
- Fix WandbTrainableMixin config for rllib trainables (22063)
- [wandb] Use resume=False per default ([21892](https://github.com/ray-project/ray/pull/21892))

🏗 Refactoring:
- [Move resource updater out of trial executor (](https://github.com/ray-project/ray/commit/cc1728120f7d49b0016d190971bc8056d3245c5d)[#23178](https://github.com/ray-project/ray/pull/23178)[)](https://github.com/ray-project/ray/commit/cc1728120f7d49b0016d190971bc8056d3245c5d)
- [Preparation for deadline schedulers (](https://github.com/ray-project/ray/commit/4a15c6f8f3ebb634f7cef967a097f621462e4e50)[#22006](https://github.com/ray-project/ray/pull/22006)[)](https://github.com/ray-project/ray/commit/4a15c6f8f3ebb634f7cef967a097f621462e4e50)
- [Single wait refactor. (](https://github.com/ray-project/ray/commit/323511b716416088859967686c71889ef8425204)[#21852](https://github.com/ray-project/ray/pull/21852)[)](https://github.com/ray-project/ray/commit/323511b716416088859967686c71889ef8425204)

📖Documentation:
- Tune docs overhaul (first part) (22112)
- [Tune overhaul part II (](https://github.com/ray-project/ray/commit/372c620f58c2269dccd5a871a72aebb9df76e32c)[#22656](https://github.com/ray-project/ray/pull/22656)[)](https://github.com/ray-project/ray/commit/372c620f58c2269dccd5a871a72aebb9df76e32c)
- Note TPESampler performance issues in docs (22545)
- hyperopt notebook (22315)

Train
🎉 New Features
- Integration with PyTorch profiler. Easily enable the pytorch profiler with Ray Train to profile training and visualize stats in Tensorboard (22345).
- Automatic pipelining of host to device transfer. While training is happening on one batch of data, the next batch of data is concurrently being moved from CPU to GPU (22716, 22974)
- Automatic Mixed Precision. Easily enable PyTorch automatic mixed precision during training (22227).

💫 Enhancements
- Add utility function to enable reproducibility for Pytorch training (22851)
- Add initial support for metrics aggregation (22099)
- Add support for `trainer.best_checkpoint` and `Trainer.load_checkpoint_path`. You can now directly access the best in memory checkpoint, or load an arbitrary checkpoint path to memory. (22306)

🔨 Fixes
- Add a utility function to turn off TF autosharding (21887)
- Fix fault tolerance for Tensorflow training (22508)
- Train utility methods (`train.report()`, etc.) can now be called outside of a Train session (21969)
- Fix accuracy calculation for CIFAR example (22292)
- Better error message for placement group time out (22845)

📖 Documentation
- Update docs for ray.train.torch import (22555)
- Clarify shuffle documentation in `prepare_data_loader` (22876)
- Denote `train.torch.get_device` as a Public API (22024)
- Minor fixes on Ray Train user guide doc (22379)


Serve
🎉 New Features
- [Deployment Graph API](https://docs.ray.io/en/master/serve/deployment-graph.html) is now in alpha. It provides a way to build, test and deploy complex inference graph composed of many deployments. (#23177, 23252, 23301, 22840, 22710, 22878, 23208, 23290, 23256, 23324, 23289, 23285, 22473, 23125, 23210)
- New experimental REST API and CLI for creating and managing deployments. (
22839, 22257, 23198, 23027, 22039, 22547, 22578, 22611, 22648, 22714, 22805, 22760, 22917, 23059, 23195, 23265, 23157, 22706, 23017, 23026, 23215)
- New sets of [HTTP adapters](https://docs.ray.io/en/master/serve/http-servehandle.html#http-adapters) making it easy to build simple application, as well as [Ray AI Runtime model wrappers](https://docs.ray.io/en/master/ray-air/getting-started.html#air-serve-integration) in alpha. (22913, 22914, 22915, 22995)
- New `health_check` API for end to end user provided health check. (22178, 22121, 22297)

🔨 Fixes
- Autoscaling algorithm will now relingquish most idle nodes when scaling down (22669)
- Serve can now manage Java replicas (22628)
- Added a hands-on self-contained MLflow and Ray Serve deployment example (22192)
- Added `root_path` setting to `http_options` (21090)
- Remove `shard_key`, `http_method`, and `http_headers` in `ServeHandle` (21590)


Dashboard
🔨Fixes:
- Update CPU and memory reporting in kubernetes. (21688)

Thanks
Many thanks to all those who contributed to this release!
edoakes, pcmoritz, jiaodong, iycheng, krfricke, smorad, kfstorm, jjyyxx, rodrigodelazcano, scv119, dmatrix, avnishn, fyrestone, clarkzinzow, wumuzi520, gramhagen, XuehaiPan, iasoon, birgerbr, n30111, tbabej, Zyiqin-Miranda, suquark, pdames, tupui, ArturNiederfahrenhorst, ashione, ckw017, siddgoel, Catch-Bull, vicyap, spolcyn, stephanie-wang, mopga, Chong-Li, jjyao, raulchen, sven1977, nikitavemuri, jbedorf, mattip, bveeramani, czgdp1807, dependabot[bot], Fabien-Couthouis, willfrey, mwtian, SlowShip, Yard1, WangTaoTheTonic, Wendi-anyscale, kaushikb11, kennethlien, acxz, DmitriGekhtman, matthewdeng, mraheja, orcahmlee, richardliaw, dsctt, yupbank, Jeffwan, gjoliver, jovany-wang, clay4444, shrekris-anyscale, jwyyy, kyle-chen-uber, simon-mo, ericl, amogkam, jianoaix, rkooo567, maxpumperla, architkulkarni, chenk008, xwjiang2010, robertnishihara, qicosmos, sriram-anyscale, SongGuyang, jon-chuang, wuisawesome, valiantljk, simonsays1980, ijrsvt

Page 5 of 15

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.