(changelog)= # Changelog ## Release 2.8.0 (2026-04-01) ### Breaking Changes: - Removed support for Python 3.9, please upgrade to Python >= 3.10 - Upgraded to Stable-Baselines3 >= 2.8.0 - Set `strict=True` for every call to `zip(...)` ### New Features: - Added official support for Python 3.13 ### Bug Fixes: - Fixed `MaskablePPO` and `RecurrentPPO` inaccurate `n_updates` counting when `target_kl` early exits the training loop - Fixed `RecurrentPPO` and `MaskablePPO` `forward` and `predict` not reshaping the action before clipping it (@immortal-boy) - Do not call `forward()` method directly in `RecurrentPPO` (@immortal-boy) - Fixed `MaskableCategorical.apply_masking()` crashing with `ValueError: Simplex` when cached `probs` deviate from sum=1 in float32 with large action spaces (torch 2.9+) (@kirann-05) ### Deprecations: ### Others: ### Documentation: - Switched to markdown documentation (using MyST parser) ## Release 2.7.1 (2025-12-05) :::{warning} Stable-Baselines3 (SB3) v2.7.1 will be the last one supporting Python 3.9 (end of life in October 2025). We highly recommended you to upgrade to Python >= 3.10. ::: ### Breaking Changes: ### New Features: ### Bug Fixes: - Fix tensorboard log name for `MaskablePPO` ### Deprecations: ### Others: ### Documentation: ## Release 2.7.0 (2025-07-25) ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 2.7.0 ### New Features: - Added support for n-step returns for off-policy algorithms via the `n_steps` parameter ### Bug Fixes: - Use the `FloatSchedule` and `LinearSchedule` classes instead of lambdas in the ARS, PPO, and QRDQN implementations to improve model portability across different operating systems ### Deprecations: ### Others: ### Documentation: ## Release 2.6.0 (2025-03-24) ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 2.6.0 - Renamed `_dump_logs()` to `dump_logs()` ### New Features: - Added support for Gymnasium v1.1.0 ### Bug Fixes: - Fixed issues with `SubprocVecEnv` and `MaskablePPO` by using `vec_env.has_attr()` (pickling issues, mask function not present) ## Release 2.5.0 (2025-01-27) ### Breaking Changes: - Upgraded to PyTorch 2.3.0 - Dropped Python 3.8 support - Upgraded to Stable-Baselines3 >= 2.5.0 ### New Features: - Added Python 3.12 support - Added Numpy v2.0 support ## Release 2.4.0 (2024-11-18) **New algorithm: added CrossQ, Gymnasium v1.0 support** ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 2.4.0 ### New Features: - Added `CrossQ` algorithm, from "Batch Normalization in Deep Reinforcement Learning" paper (@danielpalen) - Added `BatchRenorm` PyTorch layer used in `CrossQ` (@danielpalen) - Added support for Gymnasium v1.0 ### Bug Fixes: - Updated QR-DQN optimizer input to only include quantile_net parameters (@corentinlger) - Updated QR-DQN paper link in docs (@corentinlger) - Fixed a warning with PyTorch 2.4 when loading a `RecurrentPPO` model (You are using torch.load with weights_only=False) - Fixed loading QRDQN changes `target_update_interval` (@jak3122) ### Others: - Updated PyTorch version on CI to 2.3.1 - Remove unnecessary SDE noise resampling in PPO/TRPO update - Switched to uv to download packages on GitHub CI ## Release 2.3.0 (2024-03-31) **New defaults hyperparameters for QR-DQN** ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 2.3.0 - The default `learning_starts` parameter of `QRDQN` have been changed to be consistent with the other offpolicy algorithms ```python # SB3 < 2.3.0 default hyperparameters, 50_000 corresponded to Atari defaults hyperparameters # model = QRDQN("MlpPolicy", env, learning_starts=50_000) # SB3 >= 2.3.0: model = QRDQN("MlpPolicy", env, learning_starts=100) ``` ### New Features: - Added `rollout_buffer_class` and `rollout_buffer_kwargs` arguments to MaskablePPO - Log success rate `rollout/success_rate` when available for on policy algorithms ### Others: - Fixed `train_freq` type annotation for tqc and qrdqn (@Armandpl) - Fixed `sb3_contrib/common/maskable/*.py` type annotations - Fixed `sb3_contrib/ppo_mask/ppo_mask.py` type annotations - Fixed `sb3_contrib/common/vec_env/async_eval.py` type annotations ### Documentation: - Add some additional notes about `MaskablePPO` (evaluation and multi-process) (@icheered) ## Release 2.2.1 (2023-11-17) ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 2.2.1 - Switched to `ruff` for sorting imports (isort is no longer needed), black and ruff version now require a minimum version - Dropped `x is False` in favor of `not x`, which means that callbacks that wrongly returned None (instead of a boolean) will cause the training to stop (@iwishiwasaneagle) ### New Features: - Added `set_options` for `AsyncEval` - Added `rollout_buffer_class` and `rollout_buffer_kwargs` arguments to TRPO ### Others: - Fixed `ActorCriticPolicy.extract_features()` signature by adding an optional `features_extractor` argument - Update dependencies (accept newer Shimmy/Sphinx version and remove `sphinx_autodoc_typehints`) ## Release 2.1.0 (2023-08-17) ### Breaking Changes: - Removed Python 3.7 support - SB3 now requires PyTorch > 1.13 - Upgraded to Stable-Baselines3 >= 2.1.0 ### New Features: - Added Python 3.11 support ### Bug Fixes: - Fixed MaskablePPO ignoring `stats_window_size` argument ## Release 2.0.0 (2023-06-22) **Gymnasium support** :::{warning} Stable-Baselines3 (SB3) v2.0 will be the last one supporting python 3.7 (end of life in June 2023). We highly recommended you to upgrade to Python >= 3.8. ::: ### Breaking Changes: - Switched to Gymnasium as primary backend, Gym 0.21 and 0.26 are still supported via the `shimmy` package (@carlosluis, @arjun-kg, @tlpss) - Upgraded to Stable-Baselines3 >= 2.0.0 ### Bug Fixes: - Fixed QRDQN update interval for multi envs ### Others: - Fixed `sb3_contrib/tqc/*.py` type hints - Fixed `sb3_contrib/trpo/*.py` type hints - Fixed `sb3_contrib/common/envs/invalid_actions_env.py` type hints ### Documentation: - Update documentation, switch from Gym to Gymnasium ## Release 1.8.0 (2023-04-07) :::{warning} Stable-Baselines3 (SB3) v1.8.0 will be the last one to use Gym as a backend. Starting with v2.0.0, Gymnasium will be the default backend (though SB3 will have compatibility layers for Gym envs). You can find a migration guide here: . If you want to try the SB3 v2.0 alpha version, you can take a look at [PR #1327](https://github.com/DLR-RM/stable-baselines3/pull/1327). ::: ### Breaking Changes: - Removed shared layers in `mlp_extractor` (@AlexPasqua) - Upgraded to Stable-Baselines3 >= 1.8.0 ### New Features: - Added `stats_window_size` argument to control smoothing in rollout logging (@jonasreiher) ### Others: - Moved to pyproject.toml - Added github issue forms - Fixed Atari Roms download in CI - Fixed `sb3_contrib/qrdqn/*.py` type hints - Switched from `flake8` to `ruff` ### Documentation: - Added warning about potential crashes caused by `check_env` in the `MaskablePPO` docs (@AlexPasqua) ## Release 1.7.0 (2023-01-10) :::{warning} Shared layers in MLP policy (`mlp_extractor`) are now deprecated for PPO, A2C and TRPO. This feature will be removed in SB3 v1.8.0 and the behavior of `net_arch=[64, 64]` will create **separate** networks with the same architecture, to be consistent with the off-policy algorithms. ::: ### Breaking Changes: - Removed deprecated `create_eval_env`, `eval_env`, `eval_log_path`, `n_eval_episodes` and `eval_freq` parameters, please use an `EvalCallback` instead - Removed deprecated `sde_net_arch` parameter - Upgraded to Stable-Baselines3 >= 1.7.0 ### New Features: - Introduced mypy type checking - Added support for Python 3.10 - Added `with_bias` parameter to `ARSPolicy` - Added option to have non-shared features extractor between actor and critic in on-policy algorithms (@AlexPasqua) - Features extractors now properly support unnormalized image-like observations (3D tensor) when passing `normalize_images=False` ### Bug Fixes: - Fixed a bug in `RecurrentPPO` where the lstm states where incorrectly reshaped for `n_lstm_layers > 1` (thanks @kolbytn) - Fixed `RuntimeError: rnn: hx is not contiguous` while predicting terminal values for `RecurrentPPO` when `n_lstm_layers > 1` ### Deprecations: - You should now explicitly pass a `features_extractor` parameter when calling `extract_features()` - Deprecated shared layers in `MlpExtractor` (@AlexPasqua) ### Others: - Fixed flake8 config - Fixed `sb3_contrib/common/utils.py` type hint - Fixed `sb3_contrib/common/recurrent/type_aliases.py` type hint - Fixed `sb3_contrib/ars/policies.py` type hint - Exposed modules in `__init__.py` with `__all__` attribute (@ZikangXiong) - Removed ignores on Flake8 F401 (@ZikangXiong) - Upgraded GitHub CI/setup-python to v4 and checkout to v3 - Set tensors construction directly on the device - Standardized the use of `from gym import spaces` ## Release 1.6.2 (2022-10-10) **Progress bar and upgrade to latest SB3 version** ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 1.6.2 ### New Features: - Added `progress_bar` argument in the `learn()` method, displayed using TQDM and rich packages ### Deprecations: - Deprecate parameters `eval_env`, `eval_freq` and `create_eval_env` ### Others: - Fixed the return type of `.load()` methods so that they now use `TypeVar` ## Release 1.6.1 (2022-09-29) **Bug fix release** ### Breaking Changes: - Fixed the issue that `predict` does not always return action as `np.ndarray` (@qgallouedec) - Upgraded to Stable-Baselines3 >= 1.6.1 ### New Features: ### Bug Fixes: - Fixed the issue of wrongly passing policy arguments when using CnnLstmPolicy or MultiInputLstmPolicy with `RecurrentPPO` (@mlodel) - Fixed division by zero error when computing FPS when a small number of time has elapsed in operating systems with low-precision timers. - Fixed calling child callbacks in MaskableEvalCallback (@CppMaster) - Fixed missing verbose parameter passing in the `MaskableEvalCallback` constructor (@burakdmb) - Fixed the issue that when updating the target network in QRDQN, TQC, the `running_mean` and `running_var` properties of batch norm layers are not updated (@honglu2875) ### Deprecations: ### Others: - Changed the default buffer device from `"cpu"` to `"auto"` ## Release 1.6.0 (2022-07-11) **Add RecurrentPPO (aka PPO LSTM)** ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 1.6.0 - Changed the way policy "aliases" are handled ("MlpPolicy", "CnnPolicy", ...), removing the former `register_policy` helper, `policy_base` parameter and using `policy_aliases` static attributes instead (@Gregwar) - Renamed `rollout/exploration rate` key to `rollout/exploration_rate` for QRDQN (to be consistent with SB3 DQN) - Upgraded to python 3.7+ syntax using `pyupgrade` - SB3 now requires PyTorch >= 1.11 - Changed the default network architecture when using `CnnPolicy` or `MultiInputPolicy` with TQC, `share_features_extractor` is now set to False by default and the `net_arch=[256, 256]` (instead of `net_arch=[]` that was before) ### New Features: - Added `RecurrentPPO` (aka PPO LSTM) ### Bug Fixes: - Fixed a bug in `RecurrentPPO` when calculating the masked loss functions (@rnederstigt) - Fixed a bug in `TRPO` where kl divergence was not implemented for `MultiDiscrete` space ### Deprecations: ## Release 1.5.0 (2022-03-25) ### Breaking Changes: - Switched minimum Gym version to 0.21.0. - Upgraded to Stable-Baselines3 >= 1.5.0 ### New Features: - Allow PPO to turn off advantage normalization (see [PR #61](https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/pull/61)) (@vwxyzjn) ### Bug Fixes: - Removed explicit calls to `forward()` method as per pytorch guidelines ### Deprecations: ### Others: ### Documentation: ## Release 1.4.0 (2022-01-19) **Add Trust Region Policy Optimization (TRPO) and Augmented Random Search (ARS) algorithms** ### Breaking Changes: - Dropped python 3.6 support - Upgraded to Stable-Baselines3 >= 1.4.0 - `MaskablePPO` was updated to match latest SB3 `PPO` version (timeout handling and new method for the policy object) ### New Features: - Added `TRPO` (@cyprienc) - Added experimental support to train off-policy algorithms with multiple envs (note: `HerReplayBuffer` currently not supported) - Added Augmented Random Search (ARS) (@sgillen) ### Bug Fixes: ### Deprecations: ### Others: - Improve test coverage for `MaskablePPO` ### Documentation: ## Release 1.3.0 (2021-10-23) **Add Invalid action masking for PPO** :::{warning} This version will be the last one supporting Python 3.6 (end of life in Dec 2021). We highly recommended you to upgrade to Python >= 3.7. ::: ### Breaking Changes: - Removed `sde_net_arch` - Upgraded to Stable-Baselines3 >= 1.3.0 ### New Features: - Added `MaskablePPO` algorithm (@kronion) - `MaskablePPO` Dictionary Observation support (@glmcdona) ### Bug Fixes: ### Deprecations: ### Others: ### Documentation: ## Release 1.2.0 (2021-09-08) **Train/Eval mode support** ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 1.2.0 ### Bug Fixes: - QR-DQN and TQC updated so that their policies are switched between train and eval mode at the correct time (@ayeright) ### Deprecations: ### Others: - Fixed type annotation - Added python 3.9 to CI ### Documentation: ## Release 1.1.0 (2021-07-01) **Dictionary observation support and timeout handling** ### Breaking Changes: - Added support for Dictionary observation spaces (cf. SB3 doc) - Upgraded to Stable-Baselines3 >= 1.1.0 - Added proper handling of timeouts for off-policy algorithms (cf. SB3 doc) - Updated usage of logger (cf. SB3 doc) ### Bug Fixes: - Removed unused code in `TQC` ### Deprecations: ### Others: - SB3 docs and tests dependencies are no longer required for installing SB3 contrib ### Documentation: - updated QR-DQN docs checkmark typo (@minhlong94) ## Release 1.0 (2021-03-17) ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 1.0 ### Bug Fixes: - Fixed a bug with `QR-DQN` predict method when using `deterministic=False` with image space ## Pre-Release 0.11.1 (2021-02-27) ### Bug Fixes: - Upgraded to Stable-Baselines3 >= 0.11.1 ## Pre-Release 0.11.0 (2021-02-27) ### Breaking Changes: - Upgraded to Stable-Baselines3 >= 0.11.0 ### New Features: - Added `TimeFeatureWrapper` to the wrappers - Added `QR-DQN` algorithm ([@ku2482]) ### Bug Fixes: - Fixed bug in `TQC` when saving/loading the policy only with non-default number of quantiles - Fixed bug in `QR-DQN` when calculating the target quantiles (@ku2482, @guyk1971) ### Deprecations: ### Others: - Updated `TQC` to match new SB3 version - Updated SB3 min version - Moved `quantile_huber_loss` to `common/utils.py` (@ku2482) ### Documentation: ## Pre-Release 0.10.0 (2020-10-28) **Truncated Quantiles Critic (TQC)** ### Breaking Changes: ### New Features: - Added `TQC` algorithm (@araffin) ### Bug Fixes: - Fixed features extractor issue (`TQC` with `CnnPolicy`) ### Deprecations: ### Others: ### Documentation: - Added initial documentation - Added contribution guide and related PR templates ## Maintainers Stable-Baselines3 is currently maintained by [Antonin Raffin] (aka [@araffin]), [Ashley Hill] (aka @hill-a), [Maximilian Ernestus] (aka @ernestum), [Adam Gleave] ([@AdamGleave]) and [Anssi Kanervisto] (aka [@Miffyli]). ## Contributors: @ku2482 @guyk1971 @minhlong94 @ayeright @kronion @glmcdona @cyprienc @sgillen @Gregwar @rnederstigt @qgallouedec @mlodel @CppMaster @burakdmb @honglu2875 @ZikangXiong @AlexPasqua @jonasreiher @icheered @Armandpl @danielpalen @corentinlger @immortal-boy [@adamgleave]: https://github.com/adamgleave [@araffin]: https://github.com/araffin [@ku2482]: https://github.com/ku2482 [@miffyli]: https://github.com/Miffyli [adam gleave]: https://gleave.me/ [anssi kanervisto]: https://github.com/Miffyli [antonin raffin]: https://araffin.github.io/ [ashley hill]: https://github.com/hill-a [maximilian ernestus]: https://github.com/ernestum