Ray

Latest version: v2.11.0

Safety actively analyzes 621008 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 14

2.10.0

Release Highlights<a id="release-highlights"></a>
Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

- [Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
- [RLlib] “**New API Stack**” officially announced as alpha for PPO and SAC.
- [Serve] Added a default autoscaling policy set via `num_replicas=”auto”` ([42613](https://github.com/ray-project/ray/issues/42613)).
- [Serve] Added support for active load shedding via `max_queued_requests` ([42950](https://github.com/ray-project/ray/issues/42950)).
- [Serve] Added replica queue length caching to the DeploymentHandle scheduler ([42943](https://github.com/ray-project/ray/pull/42943)).
- This should improve overhead in the Serve proxy and handles.
- `max_ongoing_requests (max_concurrent_queries)` is also now strictly enforced ([42947](https://github.com/ray-project/ray/issues/42947)).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0`.
- [Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
- `max_concurrent_queries` -> `max_ongoing_requests`
- `target_num_ongoing_requests_per_replica` -> `target_ongoing_requests`
- `downscale_smoothing_factor` -> `downscaling_factor`
- `upscale_smoothing_factor` -> `upscaling_factor`
- [Serve] **WARNING**: the following default values will change in Ray 2.11:
- Default for `max_ongoing_requests` will change from 100 to 5.
- Default for `target_ongoing_requests` will change from 1 to 2.
- [Core] [Autoscaler v2](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaler-v2) is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
- [Train] Added support for accelerator types via `ScalingConfig(accelerator_type)`.
- [Train] Revamped the `XGBoostTrainer` and `LightGBMTrainer` to no longer depend on `xgboost_ray` and `lightgbm_ray`. A new, more flexible API will be released in a future release.
- [Train/Tune] Refactored local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`.

Ray Libraries<a id="ray-libraries"></a>

Ray Data<a id="ray-data"></a>

🎉 New Features:
- Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (43026, 43171, 43298, 43299, 42930, 42504)
- Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (42044, 43216, 42922, 42759).
- Allow tasks concurrency control for read, map, and write APIs (42849, 43113, 43177, 42637)
- Data dashboard and statistics improvement with more runtime metrics for each components (43790, 43628, 43241, 43477, 43110, 43112)
- Allow to specify application-level error to retry for actor task (42492)
- Add `num_rows_per_file` parameter to file-based writes (42694)
- Add `DataIterator.materialize` (43210)
- Skip schema call in `DataIterator.to_tf` if `tf.TypeSpec` is provided (42917)
- Add option to append for `Dataset.write_bigquery` (42584)
- Deprecate legacy components and classes (43575, 43178, 43347, 43349, 43342, 43341, 42936, 43144, 43022, 43023)

💫 Enhancements:

- Restructure stdout logging for better readability (43360)
- Add a more performant way to read large TFRecord datasets (42277)
- Modify `ImageDatasource` to use `Image.BILINEAR` as the default image resampling filter (43484)
- Reduce internal stack trace output by default (43251)
- Perform incremental writes to Parquet files (43563)
- Warn on excessive driver memory usage during shuffle ops (42574)
- Distributed reads for `ray.data.from_huggingface` (42599)
- Remove `Stage` class and related usages (42685)
- Improve stability of reading JSON files to avoid PyArrow errors (42558, 42357)

🔨 Fixes:

- Turn off actor locality by default (44124)
- Normalize block types before internal multi-block operations (43764)
- Fix memory metrics for `OutputSplitter` (43740)
- Fix race condition issue in `OpBufferQueue` (43015)
- Fix early stop for multiple `Limit` operators. (42958)
- Fix deadlocks caused by `Dataset.streaming_split` for job hanging (42601)

📖 Documentation:

- Revamp Ray Data documentation for GA (44006, 44007, 44008, 44098, 44168, 44093, 44105)

Ray Train<a id="ray-train"></a>

🎉 New Features:

- Add support for accelerator types via `ScalingConfig(accelerator_type)` for improved worker scheduling (43090)

💫 Enhancements:

- Add a backend-specific context manager for `train_func` for setup/teardown logic (43209)
- Remove `DEFAULT_NCCL_SOCKET_IFNAME` to simplify network configuration (42808)
- Colocate Trainer with rank 0 Worker for to improve scheduling behavior (43115)

🔨 Fixes:

- Enable scheduling workers with `memory` resource requirements (42999)
- Make path behavior OS-agnostic by using `Path.as_posix` over `os.path.join` (42037)
- [Lightning] Fix resuming from checkpoint when using `RayFSDPStrategy` (43594)
- [Lightning] Fix deadlock in `RayTrainReportCallback` (42751)
- [Transformers] Fix checkpoint reporting behavior when `get_latest_checkpoint` returns None (42953)

📖 Documentation:

- Enhance docstring and user guides for `train_loop_config` (43691)
- Clarify in `ray.train.report` docstring that it is not a barrier (42422)
- Improve documentation for `prepare_data_loader` shuffle behavior and `set_epoch` (41807)

🏗 Architecture refactoring:

- Simplify XGBoost and LightGBM Trainer integrations. Implemented `XGBoostTrainer` and `LightGBMTrainer` as `DataParallelTrainer`. Removed dependency on `xgboost_ray` and `lightgbm_ray`. (42111, 42767, 43244, 43424)
- Refactor local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to `storage_path`, rather than having another copy in the user’s home directory (`~/ray_results`). (43369, 43403, 43689)
- Split overloaded `ray.train.torch.get_device` into another `get_devices` API for multi-GPU worker setup (42314)
- Refactor restoration configuration to be centered around `storage_path` (42853, 43179)
- Deprecations related to `SyncConfig` (42909)
- Remove deprecated `preprocessor` argument from Trainers (43146, 43234)
- Hard-deprecate `MosaicTrainer` and remove `SklearnTrainer` (42814)

Ray Tune<a id="ray-tune"></a>

💫 Enhancements:

- Increase the minimum number of allowed pending trials for faster auto-scaleup (43455)
- Add support to `TBXLogger` for logging images (37822)
- Improve validation of `Experiment(config)` to handle RLlib `AlgorithmConfig` (42816, 42116)

🔨 Fixes:

- Fix `reuse_actors` error on actor cleanup for function trainables (42951)
- Make path behavior OS-agnostic by using Path.as_posix over `os.path.join` (42037)

📖 Documentation:

- Minor documentation fixes (42118, 41982)

🏗 Architecture refactoring:

- Refactor local staging directory to remove the need for `local_dir` and `RAY_AIR_LOCAL_CACHE_DIR`. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to `storage_path`, rather than having another copy in the user’s home directory (`~/ray_results`). (43369, 43403, 43689)
- Deprecations related to `SyncConfig` and `chdir_to_trial_dir` (42909)
- Refactor restoration configuration to be centered around `storage_path` (42853, 43179)
- Add back `NevergradSearch` (42305)
- Clean up invalid `checkpoint_dir` and `reporter` deprecation notices (42698)

Ray Serve<a id="ray-serve"></a>

🎉 New Features:

- Added support for active load shedding via `max_queued_requests` ([42950](https://github.com/ray-project/ray/issues/42950)).
- Added a default autoscaling policy set via `num_replicas=”auto”` ([42613](https://github.com/ray-project/ray/issues/42613)).

🏗 API Changes:

- Renamed the following parameters. Each of the old names will be supported for another release before removal.
- `max_concurrent_queries` to `max_ongoing_requests`
- `target_num_ongoing_requests_per_replica` to `target_ongoing_requests`
- `downscale_smoothing_factor` to `downscaling_factor`
- `upscale_smoothing_factor` to `upscaling_factor`
- **WARNING**: the following default values will change in Ray 2.11:
- Default for `max_ongoing_requests` will change from 100 to 5.
- Default for `target_ongoing_requests` will change from 1 to 2.

💫 Enhancements:

- Add `RAY_SERVE_LOG_ENCODING` env to set the global logging behavior for Serve ([42781](https://github.com/ray-project/ray/pull/42781)).
- Config Serve's gRPC proxy to allow large payload ([43114](https://github.com/ray-project/ray/pull/43114)).
- Add blocking flag to serve.run() ([43227](https://github.com/ray-project/ray/pull/43227)).
- Add actor id and worker id to Serve structured logs ([43725](https://github.com/ray-project/ray/pull/43725)).
- Added replica queue length caching to the DeploymentHandle scheduler ([42943](https://github.com/ray-project/ray/pull/42943)).
- This should improve overhead in the Serve proxy and handles.
- `max_ongoing_requests` (`max_concurrent_queries`) is also now strictly enforced ([42947](https://github.com/ray-project/ray/issues/42947)).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: `RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0`.
- Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas ([42578](https://github.com/ray-project/ray/pull/42578)).
- This means you can now set `max_ongoing_requests=1` for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling.
- You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
- If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable `RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0`
- Improved the downscaling behavior of smoothing_factor for low numbers of replicas ([42612](https://github.com/ray-project/ray/issues/42612)).
- Various logging improvements ([43707](https://github.com/ray-project/ray/pull/43707), [#43708](https://github.com/ray-project/ray/pull/43708), [#43629](https://github.com/ray-project/ray/pull/43629), [#43557](https://github.com/ray-project/ray/pull/43557)).
- During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones ([43187](https://github.com/ray-project/ray/pull/43187)). New replicas will be eagerly started to satisfy the target number of healthy replicas.
- This new behavior is on by default and can be turned off by setting `RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0`

🔨 Fixes:

- Fix deployment route prefix override by default route prefix from serve run cli ([43805](https://github.com/ray-project/ray/pull/43805)).
- Fixed a bug causing batch methods to hang upon cancellation ([42593](https://github.com/ray-project/ray/issues/42593)).
- Unpinned FastAPI dependency version ([42711](https://github.com/ray-project/ray/issues/42711)).
- Delay proxy marking itself as healthy until it has routes from the controller ([43076](https://github.com/ray-project/ray/issues/43076)).
- Fixed an issue where multiplexed deployments could go into infinite backoff ([43965](https://github.com/ray-project/ray/issues/43965)).
- Silence noisy `KeyError` on disconnects ([43713](https://github.com/ray-project/ray/pull/43713)).
- Fixed the prometheus counter metrics emitted as gauge bug ([43795](https://github.com/ray-project/ray/pull/43795), [#43901](https://github.com/ray-project/ray/pull/43901)).
- All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.

📖 Documentation:

- Update serve logging config docs ([43483](https://github.com/ray-project/ray/pull/43483)).
- Added documentation for `max_replicas_per_node` ([42743](https://github.com/ray-project/ray/pull/42743)).

RLlib<a id="rllib"></a>

🎉 New Features:

- The **“new API stack”** is now in alpha stage and available for **PPO single-** (42272) and **multi-agent** and for **SAC single-agent** ([42571](https://github.com/ray-project/ray/pull/42571), [#42570](https://github.com/ray-project/ray/pull/42570), [#42568](https://github.com/ray-project/ray/pull/42568))
- **ConnectorV2 API** ([43669](https://github.com/ray-project/ray/pull/43669), [#43680](https://github.com/ray-project/ray/pull/43680), [#43040](https://github.com/ray-project/ray/pull/43040), [#41074](https://github.com/ray-project/ray/pull/41074), [#41212](https://github.com/ray-project/ray/pull/41212))
- **Episode APIs** (SingleAgentEpisode and MultiAgentEpisode) ([42009](https://github.com/ray-project/ray/pull/42009), [#43275](https://github.com/ray-project/ray/pull/43275), [#42296](https://github.com/ray-project/ray/pull/42296), [#43818](https://github.com/ray-project/ray/pull/43818), [#41631](https://github.com/ray-project/ray/pull/41631))
- **EnvRunner APIs** (SingleAgentEnvRunner and MultiAgentEnvRunner) ([41558](https://github.com/ray-project/ray/pull/41558), [#41825](https://github.com/ray-project/ray/pull/41825), [#42296](https://github.com/ray-project/ray/pull/42296), [#43779](https://github.com/ray-project/ray/pull/43779))
- In preparation of **DQN** on the new API stack: PrioritizedEpisodeReplayBuffer ([43258](https://github.com/ray-project/ray/pull/43258), [#42832](https://github.com/ray-project/ray/pull/42832))

💫 Enhancements:

- **Old API Stack cleanups:**
- Move `SampleBatch` column names (e.g. `SampleBatch.OBS`) into new class (`Columns`). ([43665](https://github.com/ray-project/ray/pull/43665))
- Remove old exec_plan API code. ([41585](https://github.com/ray-project/ray/pull/41585))
- Introduce `OldAPIStack` decorator ([43657](https://github.com/ray-project/ray/pull/43657))
- **RLModule API:** Add functionality to define kernel and bias initializers via config. ([42137](https://github.com/ray-project/ray/pull/42137))
- **Learner/LearnerGroup APIs**:
- Replace Learner/LearnerGroup specific config classes (e.g. `LearnerHyperparameters`) with `AlgorithmConfig`. ([41296](https://github.com/ray-project/ray/pull/41296))
- Learner/LearnerGroup: Allow updating from Episodes. ([41235](https://github.com/ray-project/ray/pull/41235))
- In preparation of **DQN** on the new API stack: ([43199](https://github.com/ray-project/ray/pull/43199), [#43196](https://github.com/ray-project/ray/pull/43196))

🔨 Fixes:

- New API Stack bug fixes: Fix `policy_to_train` logic ([41529](https://github.com/ray-project/ray/pull/41529)), fix multi-APU for PPO on the new API stack. ([#44001](https://github.com/ray-project/ray/pull/44001)), Issue 40347: ([#42090](https://github.com/ray-project/ray/pull/42090))
- Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env ([43664](https://github.com/ray-project/ray/pull/43664)), Issue 42152 ([#43317](https://github.com/ray-project/ray/pull/43317)), issue 42396: ([#43316](https://github.com/ray-project/ray/pull/43316)), issue 41518 ([#42011](https://github.com/ray-project/ray/pull/42011)), issue 42385 ([#43313](https://github.com/ray-project/ray/pull/43313))

📖 Documentation:

- New API Stack examples: Self-play and league-based self-play ([43276](https://github.com/ray-project/ray/pull/43276)), MeanStdFilter (for both single-agent and multi-agent) ([#43274](https://github.com/ray-project/ray/pull/43274)), Prev-actions/prev-rewards for multi-agent ([#43491](https://github.com/ray-project/ray/pull/43491))
- Other docs fixes and enhancements: ([43438](https://github.com/ray-project/ray/pull/43438), [#41472](https://github.com/ray-project/ray/pull/41472), [#42117](https://github.com/ray-project/ray/pull/42177), [#43458](https://github.com/ray-project/ray/pull/43458))

Ray Core and Ray Clusters<a id="ray-core-and-ray-clusters"></a>

Ray Core<a id="ray-core"></a>

🎉 New Features:

- [Autoscaler v2](https://docs.ray.io/en/master/cluster/kubernetes/user-guides/configuring-autoscaling.html#kuberay-autoscaler-v2) is in alpha and can be tried out with Kuberay.
- Introduced [subreaper](https://docs.ray.io/en/master/ray-core/user-spawn-processes.html) to prevent leaks of sub-processes created by user code. (#42992)

💫 Enhancements:

- Ray state api `get_task()` now accepts ObjectRef (43507)
- Add an option to disable task tracing for task/actor (42431)
- Improved object transfer throughput. (43434)
- Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (42760)

🔨 Fixes:

- Fixed several bugs for streaming generator (43775, 43772, 43413)
- Fixed Ray counter metrics emitted as gauge bug (43795)
- Fixed a bug where empty resource task doesn’t work with placement group (43448)
- Fixed a bug where CPU resource is not released for a blocked worker inside placement group (43270)
- Fixed GCS crashes when PG commit phase failed due to node failure (43405)
- Fixed a bug where Ray memory monitor prematurely kill tasks (43071)
- Fixed placement group resource leak (42942)
- Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (42730)

📖 Documentation:

- Updated the doc for Ray accelerators support (41849)

Ray Clusters<a id="ray-clusters"></a>

💫 Enhancements:

- [spark] Add `heap_memory` param for `setup_ray_cluster` API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (42604)
- [spark] Add global mode for ray on spark cluster (41153)

🔨 Fixes:

- [VSphere] Only deploy ovf to first host of cluster (42258)

Thanks

Many thanks to all those who contributed to this release!

ronyw7, xsqian, justinvyu, matthewdeng, sven1977, thomasdesr, veryhannibal, klebster2, can-anyscale, simran-2797, stephanie-wang, simonsays1980, kouroshHakha, Zandew, akshay-anyscale, matschaffer-roblox, WeichenXu123, matthew29tang, vitsai, Hank0626, anmyachev, kira-lin, ericl, zcin, sihanwang41, peytondmurray, raulchen, aslonnie, ruisearch42, vszal, pcmoritz, rickyyx, chrislevn, brycehuang30, alexeykudinkin, vonsago, shrekris-anyscale, andrewsykim, c21, mattip, hongchaodeng, dabauxi, fishbone, scottjlee, justina777, surenyufuz, robertnishihara, nikitavemuri, Yard1, huchen2021, shomilj, architkulkarni, liuxsh9, Jocn2020, liuyang-my, rkooo567, alanwguo, KPostOffice, woshiyyya, n30111, edoakes, y-abe, martinbomio, jiwq, arunppsg, ArturNiederfahrenhorst, kevin85421, khluu, JingChen23, masariello, angelinalg, jjyao, omatthew98, jonathan-anyscale, sjoshi6, gaborgsomogyi, rynewang, ratnopamc, chris-ray-zhang, ijrsvt, scottsun94, raychen911, franklsf95, GeneDer, madhuri-rai07, scv119, bveeramani, anyscalesam, zen-xu, npuichigo

2.9.3

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core

🔨 Fixes:

- Fix protobuf breaking change by adding a compat layer. ([43172](https://github.com/ray-project/ray/pull/43172))
- Bump up task failure logs to warnings to make sure failures could be troubleshooted ([43147](https://github.com/ray-project/ray/pull/43147))
- Fix placement group leaks ([42942](https://github.com/ray-project/ray/pull/42942))

Ray Data

🔨 Fixes:

- Skip `schema` call in `to_tf` if `tf.TypeSpec` is provided ([42917](https://github.com/ray-project/ray/pull/42917))
- Skip recording memory spilled stats when get_memory_info_reply is failed ([42824](https://github.com/ray-project/ray/pull/42824))

Ray Serve

🔨 Fixes:

- Fixing DeploymentStateManager qualifying replicas as running prematurely ([43075](https://github.com/ray-project/ray/pull/43075))

Thanks

Many thanks to all those who contributed to this release!

rynewang, GeneDer, alexeykudinkin, edoakes, c21, rkooo567

2.9.2

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core
🔨 Fixes:
- Fix out of disk test on release branch (https://github.com/ray-project/ray/pull/42724)

Ray Data
🔨 Fixes:
- Fix failing huggingface test (https://github.com/ray-project/ray/pull/42727)
- Fix deadlocks caused by streaming_split (https://github.com/ray-project/ray/pull/42601) (https://github.com/ray-project/ray/pull/42755)
- Fix locality config not being respected in DataConfig (https://github.com/ray-project/ray/pull/42204
https://github.com/ray-project/ray/pull/42204) (https://github.com/ray-project/ray/pull/42722)
- Stability & accuracy improvements for Data+Train benchmark (https://github.com/ray-project/ray/pull/42027)
- Add retry for _sample_fragment during `ParquetDatasource._estimate_files_encoding_ratio()` (https://github.com/ray-project/ray/pull/42759) (https://github.com/ray-project/ray/pull/42774)
- Skip recording memory spilled stats when get_memory_info_reply is failed (https://github.com/ray-project/ray/pull/42824) (https://github.com/ray-project/ray/pull/42834)

Ray Serve
🔨 Fixes:
- Pin the fastapi & starlette version to avoid breaking proxy (https://github.com/ray-project/ray/pull/42740
https://github.com/ray-project/ray/pull/42740)
- Fix IS_PYDANTIC_2 logic for pydantic<1.9.0 (https://github.com/ray-project/ray/pull/42704) (https://github.com/ray-project/ray/pull/42708)
- fix missing message body for json log formats (https://github.com/ray-project/ray/pull/42729) (https://github.com/ray-project/ray/pull/42874)

Thanks

Many thanks to all those who contributed to this release!

c21, raulchen, can-anyscale, edoakes, peytondmurray, scottjlee, aslonnie, architkulkarni, GeneDer, Zandew, sihanwang41

2.9.1

This patch release contains fixes for Ray Core, Ray Data, and Ray Serve.

Ray Core
🔨 Fixes:
- Adding debupgy as the ray debugger (42311)
- Fix task events profile events per task leak (42248)
- Make sure redis sync context and async context connect to the same redis instance (42040)

Ray Data
🔨 Fixes:
- [Data] Retry write if error during file clean up (42326)

Ray Serve
🔨 Fixes:
- Improve handling the websocket server disconnect scenario (42130)
- Fix pydantic config documentation (42216)
- Address issues under high network delays:
- Enable setting queue length response deadline via environment variable (42001)
- Add exponential backoff for queue_len_response_deadline_s (42041)

2.9.0

Release Highlights<a id="release-highlights"></a>

- This release contains fixes for the Ray Dashboard. Additional context can be found here: <https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023>
- Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
- Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
- The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
- [Streaming generator](https://docs.ray.io/en/latest/ray-core/ray-generator.html) is now officially a public API (#41436, 38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the [documentation](https://docs.ray.io/en/master/ray-core/ray-generator.html) for details.
- We’ve added experimental support for new accelerators: Intel GPU (38553), Intel Gaudi Accelerators (40561), and Huawei Ascend NPU (41256).

Ray Libraries<a id="ray-libraries"></a>

Ray Data<a id="ray-data"></a>

🎉 New Features:
* Add the dashboard for Ray Data to monitor real-time execution metrics and log file for debugging (<https://docs.ray.io/en/master/data/monitoring-your-workload.html>).
* Introduce `concurrency` argument to replace `ComputeStrategy` in map-like APIs (41461)
* Allow task failures during execution (41226)
* Support PyArrow 14.0.1 (41036)
* Add new API for reading and writing Datasource (<https://github.com/ray-project/ray/issues/40296>)
* Enable group-by over multiple keys in datasets (37832)
* Add support for multiple group keys in `map_groups` (40778)

💫 Enhancements:
- Optimize `OpState.outqueue_num_blocks` (41748)
- Improve stall detection for `StreamingOutputsBackpressurePolicy` (41637)
- Enable read-only Datasets to be executed on new execution backend (41466, 41597)
- Inherit block size from downstream ops (41019)
- Use runtime object memory for scheduling (41383)
- Add retries to file writes (41263)
- Make range datasource streaming (41302)
- Test core performance metrics (40757)
- Allow `ConcurrencyCapBackpressurePolicy._cap_multiplier` to be set to 1.0 (41222)
- Create `StatsManager` to manage `_StatsActor` remote calls (40913)
- Expose `max_retry_cnt` parameter for `BigQuery` Write (41163)
- Add rows outputted to data metrics (40280)
- Add fault tolerance to remote tasks (41084)
- Add operator-level dropdown to ray data overview (40981)
- Avoid slicing too-small blocks (40840)
- Ray Data jobs detail table (40756)
- Update default shuffle block size to 1GB (40839)
- Log progress bar to data logs (40814)
- Operator level metrics (40805)

🔨 Fixes:
- Partial fix for `Dataset.context` not being sealed after creation (41569)
- Fix the issue that `DataContext` is not propagated when using `streaming_split` (41473)
- Fix Parquet partition filter bug (40947)
- Fix split read output blocks (41070)
- Fix `BigQueryDatasource `fault tolerance bugs (40986)

📖 Documentation:
- Add example of how to read and write custom file types (41785)
- Fix `ray.data.read_databricks_tables` doc (41366)
- Add `read_json` docs example for setting PyArrow block size when reading large files (40533)
- Add `AllToAllAPI` to dataset methods (40842)

Ray Train<a id="ray-train"></a>

🎉 New Features:
- Support reading `Result` from cloud storage (40622)

💫 Enhancements:
- Sort local Train workers by GPU ID (40953)
- Improve logging for Train worker scheduling information (40536)
- Load the latest unflattened metrics with `Result.from_path` (40684)
- Skip incrementing failure counter on preemption node died failures (41285)
- Update TensorFlow `ReportCheckpointCallback` to delete temporary directory (41033)

🔨 Fixes:
- Update config dataclass repr to check against None (40851)
- Add a barrier in Lightning `RayTrainReportCallback` to ensure synchronous reporting. (40875)
- Restore Tuner and `Result`s properly from moved storage path (40647)

📖 Documentation:
- Improve torch, lightning quickstarts and migration guides + fix torch restoration example (41843)
- Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (41844)
- Copy edits and adding links to docstrings (39617)
- Fix the missing ray module import in PyTorch Guide (41300)
- Fix typo in lightning_mnist_example.ipynb (40577)
- Fix typo in deepspeed.rst (40320)

🏗 Architecture refactoring:
- Remove Legacy Trainers (41276)

Ray Tune<a id="ray-tune"></a>

🎉 New Features:
- Support reading `Result` from cloud storage (40622)

💫 Enhancements:
- Skip incrementing failure counter on preemption node died failures (41285)

🔨 Fixes:
- Restore Tuner and `Result`s properly from moved storage path (40647)

📖 Documentation:
- Remove low value Tune examples and references to them (41348)
- Clarify when to use `MLflowLoggerCallback` and `setup_mlflow` (37854)

🏗 Architecture refactoring:
- Delete legacy `TuneClient`/`TuneServer` APIs (41469)
- Delete legacy `Searcher`s (41414)
- Delete legacy persistence utilities (`air.remote_storage`, etc.) (40207)

Ray Serve<a id="ray-serve"></a>

🎉 New Features:
- Introduce logging config so that users can set different logging parameters for different applications & deployments.
- Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
- Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.

💫 Enhancements:
- Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources.
- Enable async `__del__` in the deployment to execute custom clean up steps.
- Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.

🔨 Fixes:
- Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
- Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
- Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
- Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
- Updating replica log filenames to only include POSIX-compliant characters (removed the “” character).
- Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
- This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.

RLlib<a id="rllib"></a>

🎉 New Features:
- New API stack (in progress):
- New `MultiAgentEpisode` class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (40263, 40799)
- PPO runs with new `SingleAgentEnvRunner` (w/o Policy/RolloutWorker APIs). CI learning tests added. (39732, 41074, 41075)
- By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (40706)
- Old API stack:
- APPO/IMPALA: Enable using 2 separate optimizers for policy and vs (and 2 learning rates) on the old API stack. (40927)
- Added `on_workers_recreated` callback to Algorithm, which is triggered after workers have failed and been restarted. (40354)

💫 Enhancements:
- Old API stack and `rllib_contrib` cleanups: 40939, 40744, 40789, 40444, 37271

🔨 Fixes:
- Restoring from a checkpoint from an older wheel (where `AlgorithmConfig.rl_module_spec` was NOT a “property” yet) breaks when trying to load from this checkpoint. (41157)
- SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (40905)
- Other fixes: 39978, 40788, 41168, 41204

📖 Documentation:
- Updated codeblocks in RLlib. (37271)

Ray Core and Ray Clusters<a id="ray-core-and-ray-clusters"></a>

Ray Core<a id="ray-core"></a>

🎉 New Features:
- [Streaming generator](https://docs.ray.io/en/master/ray-core/ray-generator.html) is now officially a public API (#41436, 38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the [documentation](https://docs.ray.io/en/master/ray-core/ray-generator.html) for details.
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from `ObjectRefGenerator` -> “DynamicObjectRefGenerator”
- Add experimental accelerator support for new hardwares.
- Add experimental support for Intel GPU (38553)
- Add experimental support for Intel Gaudi Accelerators (40561)
- Add experimental support for Huawei Ascend NPU (41256)
- Add the initial support to run MPI based code on top of Ray.(40917, 41349)

💫 Enhancements:
- Optimize next/anext performance for streaming generator (41270)
- Make the number of connections and thread number of the object manager client tunable. (41421)
- Add `__ray_call__` default actor method (41534)

🔨 Fixes:
- Fix NullPointerException cause by raylet id is empty when get actor info in java worker (40560)
- Fix a bug where SIGTERM is ignored to worker processes (40210)
- Fix mmap file leak. (40370)
- Fix the lifetime issue in Plasma server client releasing object. (40809)
- Upgrade grpc from 1.50.2 to 1.57.1 to include security fixes (39090)
- Fix the bug where two head nodes are shown from ray list nodes (40838)
- Fix the crash when the GCS address is not valid. (41253)
- Fix the issue of unexpectedly high socket usage in ray core worker processes. (41121)
- Make worker_process_setup_hook work with strings instead of Python functions (41479)

Ray Clusters<a id="ray-clusters"></a>

💫 Enhancements:
- Stability improvements for the vSphere cluster launcher
- Better CLI output for cluster launcher

🔨 Fixes:
- Fixed `run_init` for TPU command runner

📖Documentation:
- Added missing steps and simplified YAML in top-level clusters quickstart
- Clarify that job entrypoints run on the head node by default and how to override it

Dashboard<a id="dashboard"></a>

💫 Enhancements:
- Improvements to the Ray Data Dashboard
- Added Ray Data-specific overview on jobs page, including a table view with Dataset-level metrics
- Added operator-level metrics granularity to drill down on Dataset operators
- Added additional metrics for monitoring iteration over Datasets

Docs<a id="docs"></a>
🎉 New Features:
- Updated to Sphinx version 7.1.2. Previously, the docs build used Sphinx 4.3.2. Upgrading to a recent version provides a more modern user experience while fixing many long standing issues. Let us know how you like the upgrade or any other docs issues on your mind, on the Ray Slack docs channel.

Thanks

Many thanks to all those who contributed to this release!

justinvyu, zcin, avnishn, jonathan-anyscale, shrekris-anyscale, LeonLuttenberger, c21, JingChen23, liuyang-my, ahmed-mahran, huchen2021, raulchen, scottjlee, jiwq, z4y1b2, jjyao, JoshTanke, marxav, ArturNiederfahrenhorst, SongGuyang, jerome-habana, rickyyx, rynewang, batuhanfaik, can-anyscale, allenwang28, wingkitlee0, angelinalg, peytondmurray, rueian, KamenShah, stephanie-wang, bryanjuho, sihanwang41, ericl, sofianhnaide, RaffaGonzo, xychu, simonsays1980, pcmoritz, aslonnie, WeichenXu123, architkulkarni, matthew29tang, larrylian, iycheng, hongchaodeng, rudeigerc, rkooo567, robertnishihara, alanwguo, emmyscode, kevin85421, alexeykudinkin, michaelhly, ijrsvt, ArkAung, mattip, harborn, sven1977, liuxsh9, woshiyyya, hahahannes, GeneDer, vitsai, Zandew, evalaiyc98, edoakes, matthewdeng, bveeramani

2.8.1

Not secure

Release Highlights
The Ray 2.8.1 patch release contains fixes for the Ray Dashboard.

Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023

Ray Dashboard
🔨 Fixes:

[core][state][log] Cherry pick changes to prevent state API from reading files outside the Ray log directory (41520)
[Dashboard] Migrate Logs page to use state api. (41474) (41522)

Page 1 of 14

Releases

Has known vulnerabilities

Ray

Page 1 of 14

2.10.0

2.9.3

2.9.2

2.9.1

2.9.0

2.8.1

Page 1 of 14

Links

Releases