TQC¶
Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics (TQC). Truncated Quantile Critics (TQC) builds on SAC, TD3 and QRDQN, making use of quantile regression to predict a distribution for the value function (instead of a mean value). It truncates the quantiles predicted by different networks (a bit as it is done in TD3).
Available Policies
Notes¶
Original paper: https://arxiv.org/abs/2005.04269
Original Implementation: https://github.com/bayesgroup/tqc_pytorch
Can I use?¶
Recurrent policies: ❌
Multi processing: ✔️
Gym spaces:
Space 
Action 
Observation 

Discrete 
❌ 
✔️ 
Box 
✔️ 
✔️ 
MultiDiscrete 
❌ 
✔️ 
MultiBinary 
❌ 
✔️ 
Dict 
❌ 
✔️ 
Example¶
import gym
import numpy as np
from sb3_contrib import TQC
env = gym.make("Pendulumv1")
policy_kwargs = dict(n_critics=2, n_quantiles=25)
model = TQC("MlpPolicy", env, top_quantiles_to_drop_per_net=2, verbose=1, policy_kwargs=policy_kwargs)
model.learn(total_timesteps=10000, log_interval=4)
model.save("tqc_pendulum")
del model # remove to demonstrate saving and loading
model = TQC.load("tqc_pendulum")
obs = env.reset()
while True:
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
Results¶
Result on the PyBullet benchmark (1M steps) and on BipedalWalkerHardcorev3 (2M steps) using 3 seeds. The complete learning curves are available in the associated PR.
The main difference with SAC is on harder environments (BipedalWalkerHardcore, Walker2D).
Note
Hyperparameters from the gSDE paper were used (as they are tuned for SAC on PyBullet envs), including using gSDE for the exploration and not the unstructured Gaussian noise but this should not affect results in simulation.
Note
We are using the open source PyBullet environments and not the MuJoCo simulator (as done in the original paper). You can find a complete benchmark on PyBullet envs in the gSDE paper if you want to compare TQC results to those of A2C/PPO/SAC/TD3.
Environments 
SAC 
TQC 

gSDE 
gSDE 

HalfCheetah 
2984 +/ 202 
3041 +/ 157 
Ant 
3102 +/ 37 
3700 +/ 37 
Hopper 
2262 +/ 1 
2401 +/ 62* 
Walker2D 
2136 +/ 67 
2535 +/ 94 
BipedalWalkerHardcore 
13 +/ 18 
228 +/ 18 
* with tuned hyperparameter top_quantiles_to_drop_per_net
taken from the original paper
How to replicate the results?¶
Clone RLZoo and checkout the branch feat/tqc
:
git clone https://github.com/DLRRM/rlbaselines3zoo
cd rlbaselines3zoo/
git checkout feat/tqc
Run the benchmark (replace $ENV_ID
by the envs mentioned above):
python train.py algo tqc env $ENV_ID evalepisodes 10 evalfreq 10000
Plot the results:
python scripts/all_plots.py a tqc e HalfCheetah Ant Hopper Walker2D BipedalWalkerHardcore f logs/ o logs/tqc_results
python scripts/plot_from_file.py i logs/tqc_results.pkl latex l TQC
Comments¶
This implementation is based on SB3 SAC implementation and uses the code from the original TQC implementation for the quantile huber loss.