davidquarel's picture
Upload folder using huggingface_hub
c763d83 verified

Goal Misgeneralisation Models trained

All models trained with blocks environment generation, world size 13x13. Trained with jaxgmg, see jaxgmg, branch david for more details.

#!/bin/bash
for alpha in 1e-0 1e-1 1e-2 1e-3 1e-4 3.3e-1 3.3e-2 3.3e-3 3.3e-4; do
    python -m jaxgmg train corner --num-total-env-steps 200_000_000 --keep-all-checkpoints --num-cycles-per-checkpoint 64 --wandb-project jaxgmg2 --wandb-name alpha:${alpha}-steps:200M-theta:0 --prob-shift ${alpha} --env-size 13 --env-layout blocks
done

Theta specifies the reward function: reward = proxy_goal * theta + true_goal * (1 - theta)

All these models were trained with theta=0, i.e. the true goal of getting the cheese.

The alpha parameter prob-shift controls the fraction of distinguishing v.s. undistinguishing environments. e.g. alpha=1 means the agetn always sees distinguishing environments (ones were the cheese is not in the corner) and alpha=0 means the agent always sees undistinguishing environments (ones were the cheese is always in the corner).

Checkpoints are taken every 64 cycles (~512 env steps per cycle, ~32k env steps per checkpoint?).

E.g. the path run-alpha_1e1-steps_200M/files/checkpoints/128 corresponds to training with alpha=1e-1, after 128 cycles (the second checkpoint).