Stable-baselines

Latest version: v2.10.2

Safety actively analyzes 621654 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 5

2.10.1

Breaking Changes:

- ``render()`` method of ``VecEnvs`` now only accept one argument: ``mode``

New Features:

- Added momentum parameter to A2C for the embedded RMSPropOptimizer (kantneel)
- ActionNoise is now an abstract base class and implements ``__call__``, ``NormalActionNoise`` and ``OrnsteinUhlenbeckActionNoise`` have return types (PartiallyTyped)
- HER now passes info dictionary to compute_reward, allowing for the computation of rewards that are independent of the goal (tirafesi)

Bug Fixes:

- Fixed DDPG sampling empty replay buffer when combined with HER (tirafesi)
- Fixed a bug in ``HindsightExperienceReplayWrapper``, where the openai-gym signature for ``compute_reward`` was not matched correctly (johannes-dornheim)
- Fixed SAC/TD3 checking time to update on learn steps instead of total steps (PartiallyTyped)
- Added ``**kwarg`` pass through for ``reset`` method in ``atari_wrappers.FrameStack`` (PartiallyTyped)
- Fix consistency in ``setup_model()`` for SAC, ``target_entropy`` now uses ``self.action_space`` instead of ``self.env.action_space`` (PartiallyTyped)
- Fix reward threshold in ``test_identity.py``
- Partially fix tensorboard indexing for PPO2 (enderdead)
- Fixed potential bug in ``DummyVecEnv`` where ``copy()`` was used instead of ``deepcopy()``
- Fixed a bug in ``GAIL`` where the dataloader was not available after saving, causing an error when using ``CheckpointCallback``
- Fixed a bug in ``SAC`` where any convolutional layers were not included in the target network parameters.
- Fixed ``render()`` method for ``VecEnvs``
- Fixed ``seed() method for ``SubprocVecEnv``
- Fixed a bug ``callback.locals`` did not have the correct values (PartiallyTyped)
- Fixed a bug in the ``close()`` method of ``SubprocVecEnv``, causing wrappers further down in the wrapper stack to not be closed. (NeoExtended)
- Fixed a bug in the ``generate_expert_traj()`` method in ``record_expert.py`` when using a non-image vectorized environment (jbarsce)
- Fixed a bug in CloudPickleWrapper's (used by VecEnvs) ``__setstate___`` where loading was incorrectly using ``pickle.loads`` (shwang).
- Fixed a bug in ``SAC`` and ``TD3`` where the log timesteps was not correct(YangRui2015)
- Fixed a bug where the environment was reset twice when using ``evaluate_policy``

Others:

- Added ``version.txt`` to manage version number in an easier way
- Added ``.readthedocs.yml`` to install requirements with read the docs
- Added a test for seeding ``SubprocVecEnv and rendering

Documentation:

- Fix typos (caburu)
- Fix typos in PPO2 (kvenkman)
- Removed ``stable_baselines\deepq\experiments\custom_cartpole.py`` (aakash94)
- Added Google's motion imitation project
- Added documentation page for monitor
- Fixed typos and update ``VecNormalize`` example to show normalization at test-time
- Fixed ``train_mountaincar`` description
- Added imitation baselines project
- Updated install instructions
- Added Slime Volleyball project (hardmaru)
- Added a table of the variables accessible from the ``on_step`` function of the callbacks for each algorithm (PartiallyTyped)
- Fix typo in README.md (ColinLeongUDRI)

2.10.0

Breaking Changes

- ``evaluate_policy`` now returns the standard deviation of the reward per episode
as second return value (instead of ``n_steps``)
- ``evaluate_policy`` now returns as second return value a list of the episode lengths
when ``return_episode_rewards`` is set to ``True`` (instead of ``n_steps``)
- Callback are now called after each ``env.step()`` for consistency (it was called every ``n_steps`` before
in algorithm like ``A2C`` or ``PPO2``)
- Removed unused code in ``common/a2c/utils.py`` (``calc_entropy_softmax``, ``make_path``)
- **Refactoring, including removed files and moving functions.**

- Algorithms no longer import from each other, and ``common`` does not import from algorithms.
- ``a2c/utils.py`` removed and split into other files:

- common/tf_util.py: ``sample``, ``calc_entropy``, ``mse``, ``avg_norm``, ``total_episode_reward_logger``,
``q_explained_variance``, ``gradient_add``, ``avg_norm``, ``check_shape``,
``seq_to_batch``, ``batch_to_seq``.
- common/tf_layers.py: ``conv``, ``linear``, ``lstm``, ``_ln``, ``lnlstm``, ``conv_to_fc``, ``ortho_init``.
- a2c/a2c.py: ``discount_with_dones``.
- acer/acer_simple.py: ``get_by_index``, ``EpisodeStats``.
- common/schedules.py: ``constant``, ``linear_schedule``, ``middle_drop``, ``double_linear_con``, ``double_middle_drop``,
``SCHEDULES``, ``Scheduler``.

- ``trpo_mpi/utils.py`` functions moved (``traj_segment_generator`` moved to ``common/runners.py``, ``flatten_lists`` to ``common/misc_util.py``).
- ``ppo2/ppo2.py`` functions moved (``safe_mean`` to ``common/math_util.py``, ``constfn`` and ``get_schedule_fn`` to ``common/schedules.py``).
- ``sac/policies.py`` function ``mlp`` moved to ``common/tf_layers.py``.
- ``sac/sac.py`` function ``get_vars`` removed (replaced with ``tf.util.get_trainable_vars``).
- ``deepq/replay_buffer.py`` renamed to ``common/buffers.py``.

New Features:

- Parallelized updating and sampling from the replay buffer in DQN. (flodorner)
- Docker build script, `scripts/build_docker.sh`, can push images automatically.
- Added callback collection
- Added ``unwrap_vec_normalize`` and ``sync_envs_normalization`` in the ``vec_env`` module
to synchronize two VecNormalize environment
- Added a seeding method for vectorized environments. (NeoExtended)
- Added extend method to store batches of experience in ReplayBuffer. (solliet)

Bug Fixes:

- Fixed Docker images via ``scripts/build_docker.sh`` and ``Dockerfile``: GPU image now contains ``tensorflow-gpu``,
and both images have ``stable_baselines`` installed in developer mode at correct directory for mounting.
- Fixed Docker GPU run script, ``scripts/run_docker_gpu.sh``, to work with new NVidia Container Toolkit.
- Repeated calls to ``RLModel.learn()`` now preserve internal counters for some episode
logging statistics that used to be zeroed at the start of every call.
- Fix `DummyVecEnv.render` for ``num_envs > 1``. This used to print a warning and then not render at all. (shwang)
- Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to ``learn(total_timesteps)`` reset
the environment on every call, potentially biasing samples toward early episode timesteps.
(shwang)
- Fixed by adding lazy property ``ActorCriticRLModel.runner``. Subclasses now use lazily-generated
``self.runner`` instead of reinitializing a new Runner every time ``learn()`` is called.
- Fixed a bug in ``check_env`` where it would fail on high dimensional action spaces
- Fixed ``Monitor.close()`` that was not calling the parent method
- Fixed a bug in ``BaseRLModel`` when seeding vectorized environments. (NeoExtended)
- Fixed ``num_timesteps`` computation to be consistent between algorithms (updated after ``env.step()``)
Only ``TRPO`` and ``PPO1`` update it differently (after synchronization) because they rely on MPI
- Fixed bug in ``TRPO`` with NaN standardized advantages (richardwu)
- Fixed partial minibatch computation in ExpertDataset (richardwu)
- Fixed normalization (with ``VecNormalize``) for off-policy algorithms
- Fixed ``sync_envs_normalization`` to sync the reward normalization too
- Bump minimum Gym version (>=0.11)

Others:

- Removed redundant return value from ``a2c.utils::total_episode_reward_logger``. (shwang)
- Cleanup and refactoring in ``common/identity_env.py`` (shwang)
- Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)

Documentation:

- Add dedicated page for callbacks
- Fixed example for creating a GIF (KuKuXia)
- Change Colab links in the README to point to the notebooks repo
- Fix typo in Reinforcement Learning Tips and Tricks page. (mmcenta)

2.9.0

Breaking Changes:

- The `seed` argument has been moved from `learn()` method to model constructor
in order to have reproducible results
- `allow_early_resets` of the `Monitor` wrapper now default to `True`
- `make_atari_env` now returns a `DummyVecEnv` by default (instead of a `SubprocVecEnv`)
this usually improves performance.
- Fix inconsistency of sample type, so that mode/sample function returns tensor of tf.int64 in CategoricalProbabilityDistribution/MultiCategoricalProbabilityDistribution (seheevic)

New Features:

- Add `n_cpu_tf_sess` to model constructor to choose the number of threads used by Tensorflow
- Environments are automatically wrapped in a `DummyVecEnv` if needed when passing them to the model constructor
- Added `stable_baselines.common.make_vec_env` helper to simplify VecEnv creation
- Added `stable_baselines.common.evaluation.evaluate_policy` helper to simplify model evaluation
- `VecNormalize` changes:

- Now supports being pickled and unpickled (AdamGleave).
- New methods `.normalize_obs(obs)` and `normalize_reward(rews)` apply normalization
to arbitrary observation or rewards without updating statistics (shwang)
- `.get_original_reward()` returns the unnormalized rewards from the most recent timestep
- `.reset()` now collects observation statistics (used to only apply normalization)

- Add parameter `exploration_initial_eps` to DQN. (jdossgollin)
- Add type checking and PEP 561 compliance.
Note: most functions are still not annotated, this will be a gradual process.
- DDPG, TD3 and SAC accept non-symmetric action spaces. (Antymon)
- Add `check_env` util to check if a custom environment follows the gym interface (araffin and justinkterry)

Bug Fixes:

- Fix seeding, so it is now possible to have deterministic results on cpu
- Fix a bug in DDPG where `predict` method with `deterministic=False` would fail
- Fix a bug in TRPO: mean_losses was not initialized causing the logger to crash when there was no gradients (MarvineGothic)
- Fix a bug in `cmd_util` from API change in recent Gym versions
- Fix a bug in DDPG, TD3 and SAC where warmup and random exploration actions would end up scaled in the replay buffer (Antymon)

Deprecations:

- `nprocs` (ACKTR) and `num_procs` (ACER) are deprecated in favor of `n_cpu_tf_sess` which is now common
to all algorithms
- `VecNormalize`: `load_running_average` and `save_running_average` are deprecated in favour of using pickle.

Others:

- Add upper bound for Tensorflow version (<2.0.0).
- Refactored test to remove duplicated code
- Add pull request template
- Replaced redundant code in load_results (jbulow)
- Minor PEP8 fixes in dqn.py (justinkterry)
- Add a message to the assert in `PPO2`
- Update replay buffer doctring
- Fix `VecEnv` docstrings

Documentation:

- Add plotting to the Monitor example (rusu24edward)
- Add Snake Game AI project (pedrohbtp)
- Add note on the support Tensorflow versions.
- Remove unnecessary steps required for Windows installation.
- Remove `DummyVecEnv` creation when not needed
- Added `make_vec_env` to the examples to simplify VecEnv creation
- Add QuaRL project (srivatsankrishnan)
- Add Pwnagotchi project (evilsocket)
- Fix multiprocessing example (rusu24edward)
- Fix `result_plotter` example
- Add JNRR19 tutorial (by edbeeching, hill-a and araffin)
- Updated notebooks link
- Fix typo in algos.rst, "containes" to "contains" (SyllogismRXS)
- Fix outdated source documentation for load_results
- Add PPO_CPP project (Antymon)
- Add section on C++ portability of Tensorflow models (Antymon)
- Update custom env documentation to reflect new gym API for the `close()` method (justinkterry)
- Update custom env documentation to clarify what step and reset return (justinkterry)
- Add RL tips and tricks for doing RL experiments
- Corrected lots of typos
- Add spell check to documentation if available

2.8.0

Breaking Changes:

- OpenMPI-dependent algorithms (PPO1, TRPO, GAIL, DDPG) are disabled
in the default installation of stable\_baselines. mpi4py is now
installed as an extra. When mpi4py is not available,
stable-baselines skips imports of OpenMPI-dependent algorithms. See
installation notes \<openmpi\> and
[Issue \430](https://github.com/hill-a/stable-baselines/issues/430).
- SubprocVecEnv now defaults to a thread-safe start method, forkserver
when available and otherwise spawn. This may require application
code be wrapped in if \_\_name\_\_ == '\_\_main\_\_'. You can
restore previous behavior by explicitly setting start\_method =
'fork'. See [PR \428](https://github.com/hill-a/stable-baselines/pull/428).
- Updated dependencies: tensorflow v1.8.0 is now required
- Removed checkpoint\_path and checkpoint\_freq argument from DQN that
were not used
- Removed bench/benchmark.py that was not used
- Removed several functions from common/tf\_util.py that were not used
- Removed ppo1/run\_humanoid.py

New Features:

- **important change** Switch to using zip-archived JSON and Numpy
savez for storing models for better support across library/Python
versions. (Miffyli)
- ACKTR now supports continuous actions
- Add double\_q argument to DQN constructor

Bug Fixes:

- Skip automatic imports of OpenMPI-dependent algorithms to avoid an
issue where OpenMPI would cause stable-baselines to hang on Ubuntu
installs. See installation notes
\<openmpi\> and [Issue \430](https://github.com/hill-a/stable-baselines/issues/430).
- Fix a bug when calling logger.configure() with MPI enabled
(keshaviyengar)
- set allow\_pickle=True for numpy\>=1.17.0 when loading expert
dataset
- Fix a bug when using VecCheckNan with numpy ndarray as state. [Issue \489](https://github.com/hill-a/stable-baselines/issues/489). (ruifeng96150)

Deprecations:

- Models saved with cloudpickle format (stable-baselines\<=2.7.0) are
now deprecated in favor of zip-archive format for better support
across Python/Tensorflow versions. (Miffyli)

Others:

- Implementations of noise classes (AdaptiveParamNoiseSpec,
NormalActionNoise, OrnsteinUhlenbeckActionNoise) were moved from
stable\_baselines.ddpg.noise to stable\_baselines.common.noise. The
API remains backward-compatible; for example from
stable\_baselines.ddpg.noise import NormalActionNoise is still okay.
(shwang)
- Docker images were updated
- Cleaned up files in common/ folder and in acktr/ folder that were
only used by old ACKTR version (e.g. filter.py)
- Renamed acktr\_disc.py to acktr.py

Documentation:

- Add WaveRL project (jaberkow)
- Add Fenics-DRL project (DonsetPG)
- Fix and rename custom policy names (eavelardev)
- Add documentation on exporting models.
- Update maintainers list (Welcome to Miffyli)

2.7.0

New Features

- added Twin Delayed DDPG (TD3) algorithm, with HER support
- added support for continuous action spaces to action\_probability, computing the
PDF of a Gaussian policy in addition to the existing support for categorical stochastic policies.
- added flag to action\_probability to return log-probabilities.
- added support for python lists and numpy arrays in `logger.writekvs`. (dwiel)
- the info dict returned by VecEnvs now include a `terminal_observation` key providing access to the last observation in a trajectory. (qxcv)

Bug Fixes

- fixed a bug in `traj_segment_generator` where the `episode_starts` was wrongly recorded, resulting in wrong calculation of Generalized Advantage Estimation (GAE), this affects TRPO, PPO1 and GAIL (thanks to miguelrass for spotting the bug)
- added missing property n\_batch in BasePolicy.

Others

- renamed some keys in `traj_segment_generator` to be more meaningful
- retrieve unnormalized reward when using Monitor wrapper with TRPO, PPO1 and GAIL to display them in the logs (mean episode reward)
- clean up DDPG code (renamed variables)

Documentation

- doc fix for the hyperparameter tuning command in the rl zoo
- added an example on how to log additional variable with tensorboard and a callback

2.6.0

sys.modules['stable_baselines.ddpg.memory'] = stable_baselines.deepq.replay_buffer
stable_baselines.deepq.replay_buffer.Memory = stable_baselines.deepq.replay_buffer.ReplayBuffer

We recommend you to save again the model afterward, so the fix won't be needed the next time the trained agent is loaded.

New Features:

- **revamped HER implementation**: clean re-implementation from scratch, now supports DQN, SAC and DDPG
- add `action_noise` param for SAC, it helps exploration for problem with deceptive reward
- The parameter `filter_size` of the function `conv` in A2C utils now supports passing a list/tuple of two integers (height and width), in order to have non-squared kernel matrix. (yutingsz)
- add `random_exploration` parameter for DDPG and SAC, it may be useful when using HER + DDPG/SAC. This hack was present in the original OpenAI Baselines DDPG + HER implementation.
- added `load_parameters` and `get_parameters` to base RL class. With these methods, users are able to load and get parameters to/from existing model, without touching tensorflow. (Miffyli)
- added specific hyperparameter for PPO2 to clip the value function (`cliprange_vf`)
- added `VecCheckNan` wrapper

Bug Fixes:

- bugfix for `VecEnvWrapper.__getattr__` which enables access to class attributes inherited from parent classes.
- fixed path splitting in `TensorboardWriter._get_latest_run_id()` on Windows machines (PatrickWalter214)
- fixed a bug where initial learning rate is logged instead of its placeholder in `A2C.setup_model` (sc420)
- fixed a bug where number of timesteps is incorrectly updated and logged in `A2C.learn` and `A2C._train_step` (sc420)
- fixed `num_timesteps` (total\_timesteps) variable in PPO2 that was wrongly computed.
- fixed a bug in DDPG/DQN/SAC, when there were the number of samples in the replay buffer was lesser than the batch size (thanks to dwiel for spotting the bug)
- **removed** `a2c.utils.find_trainable_params` please use `common.tf_util.get_trainable_vars` instead. `find_trainable_params` was returning all trainable variables, discarding the scope argument. This bug was causing the model to save duplicated parameters (for DDPG and SAC) but did not affect the performance.

Deprecations:

- **deprecated** `memory_limit` and `memory_policy` in DDPG, please use `buffer_size` instead. (will be removed in v3.x.x)

Others:

- **important change** switched to using dictionaries rather than lists when storing parameters, with tensorflow Variable names being the keys. (Miffyli)
- removed unused dependencies (tdqm, dill, progressbar2, seaborn, glob2, click)
- removed `get_available_gpus` function which hadn't been used anywhere (Pastafarianist)

Documentation:

- added guide for managing `NaN` and `inf`
- updated ven\_env doc
- misc doc updates

Page 1 of 5

Releases

Has known vulnerabilities

Stable-baselines

Page 1 of 5

2.10.1

2.10.0

2.9.0

2.8.0

2.7.0

2.6.0

Page 1 of 5

Links

Releases