-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the bug
Hi,
I noticed a difference in the behavior of the NormalizeReward wrapper depending on whether it’s applied to an Env or a SyncVectorEnv.
When used on SyncVectorEnv, the autoreset reward appears to be included in the normalization calculation. I don’t think this should happen. This behavior seems to have started in version 1.0.0, with the introduction of the new autoreset mechanism for vectorized environments.
In contrast, when NormalizeReward is applied directly to an Env, the reset reward is not included in the normalization calculation.
This leads to two different results with the code below, even though I would expect them to be the same. Am I right?
Code example
import gymnasium as gym
# SyncVectorEnv -> NormalizeReward -> Env
def make_env(env_id: str, seed: int):
def thunk():
env = gym.make(env_id)
env = gym.wrappers.NormalizeReward(env, gamma=0.99, epsilon=1e-8)
env.observation_space.seed(seed)
env.action_space.seed(seed)
return env
return thunk
envs = gym.vector.SyncVectorEnv(
[make_env("Hopper-v5", 123)]
)
_ = envs.reset(seed=123)
for _ in range(50):
observation, reward, terminated, truncated, info = envs.step(envs.action_space.sample())
print("SyncVectorEnv -> NormalizeReward -> Env")
print("Return RMS count:", envs.envs[0].return_rms.count)
print("Return RMS mean:", envs.envs[0].return_rms.mean)
print("Return RMS var:", envs.envs[0].return_rms.var)
envs.close()
# NormalizeReward -> SyncVectorEnv -> Env
def make_env(env_id: str, seed: int):
def thunk():
env = gym.make(env_id)
env.observation_space.seed(seed)
env.action_space.seed(seed)
return env
return thunk
envs = gym.vector.SyncVectorEnv(
[make_env("Hopper-v5", 123)]
)
envs = gym.wrappers.vector.NormalizeReward(envs, gamma=0.99, epsilon=1e-8)
_ = envs.reset(seed=123)
for _ in range(50):
observation, reward, terminated, truncated, info = envs.step(envs.action_space.sample())
print("NormalizeReward -> SyncVectorEnv -> Env")
print("Return RMS count:", envs.return_rms.count)
print("Return RMS mean:", envs.return_rms.mean)
print("Return RMS var:", envs.return_rms.var)
envs.close()
Output:
SyncVectorEnv -> NormalizeReward -> Env
Return RMS count: 48.0001
Return RMS mean: 6.056133384415303
Return RMS var: 14.202972971858856
NormalizeReward -> SyncVectorEnv -> Env
Return RMS count: 50.0001
Return RMS mean: 5.781981257241382
Return RMS var: 15.481151179083554System info
Gymnasium v1.2.0 installed with pip (Python 3.12.3) on Ubuntu 24.04 WSL2 (Window 11).
Additional context
No response
Checklist
- I have checked that there is no similar issue in the repo