DaisyCore β daisy_milli
Model Description
DaisyCore transformer with 26 layers, 14 attention heads, and a model dimension of 1,792. Uses block-causal sliding window attention (window size 2,048) with standard attention implementation.
Architecture
| Property |
Value |
| Architecture |
DaisyCore |
| Layers |
26 |
| Attention Heads |
14 |
| Model Dimension |
1,792 |
| Head Dimension |
128 |
| Sliding Window Size |
2,048 |
| Max Sequence Length |
131,072 |
| Vocabulary Size |
49,152 |
| Attention Implementation |
standard |
| Value Embeddings |
True |
| Tied Embeddings |
False |
| Skip Mix Mode |
linear |
| Tokenizer |
jonathanmiddleton/daisy |
| Dtype |
bfloat16 |
| Parameters (total) |
2,323,120,245 |
| Parameters (non-embedding) |
1,001,914,485 |
| Parameters (embedding) |
1,321,205,760 |
Training Progress
| Metric |
Value |
| Checkpoint Step |
2,750 |
| Tokens Processed |
11.53B (11,534,336,000) |
| Target Tokens |
13.94B (13,941,866,496) |
| Progress |
82.7% |
| Best Validation Loss |
1.58289 |
| Evaluations Performed |
55 |
| HellaSwag (acc_norm) |
60.95% |
| MMLU (acc) |
33.43% |
| Saved |
2026-03-09 18:23 UTC |
Training Configuration
Optimizers
| Optimizer |
Parameter Group |
Learning Rate |
| AdamW |
head_params |
0.003216 |
| AdamW |
embed_params |
0.1865 |
| AdamW |
scalar_params |
0.02099 |
| Muon |
hidden_matrix_params |
0.025 |
Schedule & Regularization
| Parameter |
Value |
| LR Scale |
1.0 |
| LR Schedule |
n_phase_linear |
| LR Schedule β begin_after_fraction |
0.0 |
| LR Schedule β cooldown_fraction |
0.0 |
| LR Schedule β floor |
0.0 |
| LR Schedule β phases |
[{'progress': 0.0, 'scale': 0.10171}, {'progress': 0.3, 'scale': 0.1}, {'progress': 1.0, 'scale': 0.05}] |
| LR Schedule β warmup_fraction |
0.0 |
| Gradient Accumulation Steps |
4 |
| Muon Warmup Steps |
300 |
| Seed |
1337 |
Training Data
| Type |
Sequence Length |
Path |
| fineweb-edu-shuffled |
16,384 |
data/fineweb-edu-shuffled/train/*.bin |
| daisypie_chat |
16,384 |
data/daisypie_chat/ |
All Hyperparameters
| Parameter |
Value |
| window_size |
2048 |
| vocab_size |
49152 |
| eos_token_id |
49131 |
| num_layers |
26 |
| num_heads |
14 |
| model_dim |
1792 |
| head_dim |
128 |
| max_seq_len |
131072 |
| model_spec |
daisy_milli |
| model_class |
models.daisy.daisy_core.DaisyCore |
| target_tokens |
13941866496 |
| full_window_target_tokens |
13941866496 |
| torch_coordinate_descent_tuning |
False |
| torch_inductor_config_max_autotune |
False |
| overfit |
False |
| full_windows |
True |
| wandb_log |
True |
| wandb_project |
milli |
| wandb_run_name |
milli_v18de_v2 |
| wandb_group |
pretrain |
| init_model |
JonathanMiddleton/daisy-milli-base-v18d.e-tokens296879128576 |
| use_value_embeddings |
True |
| use_tied_embeddings |
False |
| seed |
1337 |
| task_val_debug_log_samples |
False |
| log_interval |
16384 |
| muon_warmup_steps |
300 |
| lr_scale |
1.0 |
| cooldown_fraction |
0.0 |
| lr_schedule |
{"name": "n_phase_linear", "config": {"cooldown_fraction": 0.0, "phases": [{"progress": 0.0, "scale": 0.10171}, {"progress": 0.3, "scale": 0.1}, {"progress": 1.0, "scale": 0.05}], "floor": 0.0, "warmup_fraction": 0.0, "begin_after_fraction": 0.0}} |
| grad_acc_steps |
4 |
| val_loss_every_tokens |
209715200 |
| checkpoint_warmup_tokens |
6000000000 |
| checkpoint_per_n_tokens |
0 |
| save_checkpoint |
True |
| benchmarks_frequency |
1 |
| mmlu_cache_bin_path |
data/mmlu_cache/mmlu_cache.bin |
| mmlu_cache_bin_rebuild |
False |
| task_training |
False |
| track_last_n_layers |
0 |