Fantastic Pretraining Optimizers and Where to Find Them
Paper
•
2509.02046
•
Published
•
13
scion300m24B| Hyperparameter | Value |
|---|---|
| beta1 | 0.98 |
| decay | 0.8 |
| learning_rate | 0.008 |
| lr_schedule | linear |
| max_grad_norm | 2 |
| min_lr_ratio | 0 |
| momentum | 0.95 |
| scion_epsilon | 1e-05 |
| scion_to_signum_lr | 0.1 |
| train_batch_size | 128 |
| warmup | 0 |
| weight_decay | 0.1 |