2025-08-28 22:07:06 - pico-train - INFO - Step 0 -- 📊 Evaluation Results 2025-08-28 22:07:06 - pico-train - INFO - └── paloma: inf 2025-08-28 22:07:06 - pico-train - INFO - ================================================== 2025-08-28 22:07:06 - pico-train - INFO - ✨ Training Configuration 2025-08-28 22:07:06 - pico-train - INFO - ================================================== 2025-08-28 22:07:06 - pico-train - INFO - ╭─────────────────────────────────────────────────────╮ 2025-08-28 22:07:06 - pico-train - INFO - │ checkpointing: │ 2025-08-28 22:07:06 - pico-train - INFO - │ checkpoints_dir: checkpoints │ 2025-08-28 22:07:06 - pico-train - INFO - │ evaluation: │ 2025-08-28 22:07:06 - pico-train - INFO - │ eval_results_dir: eval_results │ 2025-08-28 22:07:06 - pico-train - INFO - │ fabric_checkpoint_dir: fabric_state │ 2025-08-28 22:07:06 - pico-train - INFO - │ fabric_checkpoint_filename: checkpoint.pt │ 2025-08-28 22:07:06 - pico-train - INFO - │ hf_checkpoint: │ 2025-08-28 22:07:06 - pico-train - INFO - │ collection_slug: null │ 2025-08-28 22:07:06 - pico-train - INFO - │ repo_id: ThomasTheMaker/pico-decoder-tiny │ 2025-08-28 22:07:06 - pico-train - INFO - │ learning_dynamics: │ 2025-08-28 22:07:06 - pico-train - INFO - │ batch_size: 4 │ 2025-08-28 22:07:06 - pico-train - INFO - │ eval_data: null │ 2025-08-28 22:07:06 - pico-train - INFO - │ layer_suffixes: │ 2025-08-28 22:07:06 - pico-train - INFO - │ - attention.v_proj │ 2025-08-28 22:07:06 - pico-train - INFO - │ - attention.o_proj │ 2025-08-28 22:07:06 - pico-train - INFO - │ - swiglu.w_2 │ 2025-08-28 22:07:06 - pico-train - INFO - │ sequence_idx: -1 │ 2025-08-28 22:07:06 - pico-train - INFO - │ learning_dynamics_dir: learning_dynamics │ 2025-08-28 22:07:06 - pico-train - INFO - │ logs_dir: logs │ 2025-08-28 22:07:06 - pico-train - INFO - │ run_name: pico-decoder-tiny │ 2025-08-28 22:07:06 - pico-train - INFO - │ runs_dir: runs │ 2025-08-28 22:07:06 - pico-train - INFO - │ save_every_n_steps: 1000 │ 2025-08-28 22:07:06 - pico-train - INFO - │ save_to_hf: true │ 2025-08-28 22:07:06 - pico-train - INFO - │ training: │ 2025-08-28 22:07:06 - pico-train - INFO - │ auto_resume: true │ 2025-08-28 22:07:06 - pico-train - INFO - │ data: │ 2025-08-28 22:07:06 - pico-train - INFO - │ dataloader: │ 2025-08-28 22:07:06 - pico-train - INFO - │ batch_size: 256 │ 2025-08-28 22:07:06 - pico-train - INFO - │ dataset: │ 2025-08-28 22:07:06 - pico-train - INFO - │ name: pico-lm/pretokenized-dolma-tinsy │ 2025-08-28 22:07:06 - pico-train - INFO - │ tokenizer: │ 2025-08-28 22:07:06 - pico-train - INFO - │ name: allenai/OLMo-7B-0724-hf │ 2025-08-28 22:07:06 - pico-train - INFO - │ vocab_size: 50304 │ 2025-08-28 22:07:06 - pico-train - INFO - │ evaluation: │ 2025-08-28 22:07:06 - pico-train - INFO - │ metrics: │ 2025-08-28 22:07:06 - pico-train - INFO - │ - paloma │ 2025-08-28 22:07:06 - pico-train - INFO - │ paloma: │ 2025-08-28 22:07:06 - pico-train - INFO - │ batch_size: 1 │ 2025-08-28 22:07:06 - pico-train - INFO - │ dataset_name: pico-lm/pretokenized-paloma-tinsy │ 2025-08-28 22:07:06 - pico-train - INFO - │ dataset_split: val │ 2025-08-28 22:07:06 - pico-train - INFO - │ max_length: 2048 │ 2025-08-28 22:07:06 - pico-train - INFO - │ model: │ 2025-08-28 22:07:06 - pico-train - INFO - │ activation_hidden_dim: 384 │ 2025-08-28 22:07:06 - pico-train - INFO - │ attention_n_heads: 12 │ 2025-08-28 22:07:06 - pico-train - INFO - │ attention_n_kv_heads: 4 │ 2025-08-28 22:07:06 - pico-train - INFO - │ batch_size: 1024 │ 2025-08-28 22:07:06 - pico-train - INFO - │ d_model: 96 │ 2025-08-28 22:07:06 - pico-train - INFO - │ max_seq_len: 2048 │ 2025-08-28 22:07:06 - pico-train - INFO - │ model_type: pico_decoder │ 2025-08-28 22:07:06 - pico-train - INFO - │ n_layers: 12 │ 2025-08-28 22:07:06 - pico-train - INFO - │ norm_eps: 1.0e-06 │ 2025-08-28 22:07:06 - pico-train - INFO - │ position_emb_theta: 10000.0 │ 2025-08-28 22:07:06 - pico-train - INFO - │ vocab_size: 50304 │ 2025-08-28 22:07:06 - pico-train - INFO - │ monitoring: │ 2025-08-28 22:07:06 - pico-train - INFO - │ logging: │ 2025-08-28 22:07:06 - pico-train - INFO - │ log_every_n_steps: 100 │ 2025-08-28 22:07:06 - pico-train - INFO - │ log_level: INFO │ 2025-08-28 22:07:06 - pico-train - INFO - │ save_to_wandb: false │ 2025-08-28 22:07:06 - pico-train - INFO - │ wandb: │ 2025-08-28 22:07:06 - pico-train - INFO - │ entity: boymyc │ 2025-08-28 22:07:06 - pico-train - INFO - │ project: pico-decoder-tiny │ 2025-08-28 22:07:06 - pico-train - INFO - │ training: │ 2025-08-28 22:07:06 - pico-train - INFO - │ fabric: │ 2025-08-28 22:07:06 - pico-train - INFO - │ accelerator: cuda │ 2025-08-28 22:07:06 - pico-train - INFO - │ num_devices: 1 │ 2025-08-28 22:07:06 - pico-train - INFO - │ num_nodes: 1 │ 2025-08-28 22:07:06 - pico-train - INFO - │ precision: bf16-mixed │ 2025-08-28 22:07:06 - pico-train - INFO - │ max_steps: 200000 │ 2025-08-28 22:07:06 - pico-train - INFO - │ optimization: │ 2025-08-28 22:07:06 - pico-train - INFO - │ gradient_accumulation_steps: 4 │ 2025-08-28 22:07:06 - pico-train - INFO - │ lr: 0.0003 │ 2025-08-28 22:07:06 - pico-train - INFO - │ lr_scheduler: linear_with_warmup │ 2025-08-28 22:07:06 - pico-train - INFO - │ lr_warmup_steps: 2500 │ 2025-08-28 22:07:06 - pico-train - INFO - │ optimizer: adamw │ 2025-08-28 22:07:06 - pico-train - INFO - │ │ 2025-08-28 22:07:06 - pico-train - INFO - ╰─────────────────────────────────────────────────────╯ 2025-08-28 22:07:06 - pico-train - INFO - ================================================== 2025-08-28 22:07:06 - pico-train - INFO - ⛭ Runtime Summary: 2025-08-28 22:07:06 - pico-train - INFO - ================================================== 2025-08-28 22:07:06 - pico-train - INFO - Starting from step: 0 2025-08-28 22:07:06 - pico-train - INFO - Model Setup: 2025-08-28 22:07:06 - pico-train - INFO - └─ Total Parameters: 11,282,784 2025-08-28 22:07:06 - pico-train - INFO - └─ Trainable Parameters: 11,282,784 2025-08-28 22:07:06 - pico-train - INFO - Distributed Setup: 2025-08-28 22:07:06 - pico-train - INFO - └─ Number of Devices: 1 2025-08-28 22:07:06 - pico-train - INFO - └─ Device Type: NVIDIA GeForce RTX 5090 2025-08-28 22:07:06 - pico-train - INFO - └─ Available Memory: 33.68 GB 2025-08-28 22:07:06 - pico-train - INFO - Software Setup: 2025-08-28 22:07:06 - pico-train - INFO - └─ Python Version: 3.10.12 2025-08-28 22:07:06 - pico-train - INFO - └─ PyTorch Version: 2.8.0+cu128 2025-08-28 22:07:06 - pico-train - INFO - └─ CUDA Version: 12.8 2025-08-28 22:07:06 - pico-train - INFO - └─ Operating System: Linux 6.8.0-63-generic 2025-08-28 22:07:06 - pico-train - INFO - Batch Size Configuration: 2025-08-28 22:07:06 - pico-train - INFO - └─ Global Batch Size: 4 2025-08-28 22:07:06 - pico-train - INFO - └─ Per Device Batch Size: 1 2025-08-28 22:07:06 - pico-train - INFO - └─ Gradient Accumulation Steps: 4 2025-08-28 22:07:06 - pico-train - INFO - ================================================== 2025-08-28 22:07:07 - pico-train - INFO - Step 0 -- 🔄 Training Metrics 2025-08-28 22:07:07 - pico-train - INFO - ├── Loss: 10.9886 2025-08-28 22:07:07 - pico-train - INFO - ├── Learning Rate: 0.00e+00 2025-08-28 22:07:07 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:07:07 - pico-train - INFO - Step 0 -- 📈 Saving Learning Dynamics 2025-08-28 22:08:00 - pico-train - INFO - Step 100 -- 🔄 Training Metrics 2025-08-28 22:08:00 - pico-train - INFO - ├── Loss: 10.9373 2025-08-28 22:08:00 - pico-train - INFO - ├── Learning Rate: 1.20e-05 2025-08-28 22:08:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:08:51 - pico-train - INFO - Step 200 -- 🔄 Training Metrics 2025-08-28 22:08:51 - pico-train - INFO - ├── Loss: 10.5423 2025-08-28 22:08:51 - pico-train - INFO - ├── Learning Rate: 2.40e-05 2025-08-28 22:08:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:09:43 - pico-train - INFO - Step 300 -- 🔄 Training Metrics 2025-08-28 22:09:43 - pico-train - INFO - ├── Loss: 9.9452 2025-08-28 22:09:43 - pico-train - INFO - ├── Learning Rate: 3.60e-05 2025-08-28 22:09:43 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:10:34 - pico-train - INFO - Step 400 -- 🔄 Training Metrics 2025-08-28 22:10:34 - pico-train - INFO - ├── Loss: 9.4490 2025-08-28 22:10:34 - pico-train - INFO - ├── Learning Rate: 4.80e-05 2025-08-28 22:10:34 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:11:25 - pico-train - INFO - Step 500 -- 🔄 Training Metrics 2025-08-28 22:11:25 - pico-train - INFO - ├── Loss: 8.8455 2025-08-28 22:11:25 - pico-train - INFO - ├── Learning Rate: 6.00e-05 2025-08-28 22:11:25 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:12:16 - pico-train - INFO - Step 600 -- 🔄 Training Metrics 2025-08-28 22:12:16 - pico-train - INFO - ├── Loss: 8.1482 2025-08-28 22:12:16 - pico-train - INFO - ├── Learning Rate: 7.20e-05 2025-08-28 22:12:16 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:13:08 - pico-train - INFO - Step 700 -- 🔄 Training Metrics 2025-08-28 22:13:08 - pico-train - INFO - ├── Loss: 7.4303 2025-08-28 22:13:08 - pico-train - INFO - ├── Learning Rate: 8.40e-05 2025-08-28 22:13:08 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:13:59 - pico-train - INFO - Step 800 -- 🔄 Training Metrics 2025-08-28 22:13:59 - pico-train - INFO - ├── Loss: 7.0363 2025-08-28 22:13:59 - pico-train - INFO - ├── Learning Rate: 9.60e-05 2025-08-28 22:13:59 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:14:50 - pico-train - INFO - Step 900 -- 🔄 Training Metrics 2025-08-28 22:14:50 - pico-train - INFO - ├── Loss: 6.9702 2025-08-28 22:14:50 - pico-train - INFO - ├── Learning Rate: 1.08e-04 2025-08-28 22:14:50 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:15:40 - pico-train - INFO - Step 1000 -- 💾 Saving Checkpoint 2025-08-28 22:17:41 - pico-train - INFO - Step 1000 -- 📊 Evaluation Results 2025-08-28 22:17:41 - pico-train - INFO - └── paloma: 9.54583880403771e+19 2025-08-28 22:17:43 - pico-train - INFO - Step 1000 -- 🔄 Training Metrics 2025-08-28 22:17:43 - pico-train - INFO - ├── Loss: 6.8975 2025-08-28 22:17:43 - pico-train - INFO - ├── Learning Rate: 1.20e-04 2025-08-28 22:17:43 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:17:43 - pico-train - INFO - Step 1000 -- 📈 Saving Learning Dynamics 2025-08-28 22:18:37 - pico-train - INFO - Step 1100 -- 🔄 Training Metrics 2025-08-28 22:18:37 - pico-train - INFO - ├── Loss: 6.8920 2025-08-28 22:18:37 - pico-train - INFO - ├── Learning Rate: 1.32e-04 2025-08-28 22:18:37 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:19:28 - pico-train - INFO - Step 1200 -- 🔄 Training Metrics 2025-08-28 22:19:28 - pico-train - INFO - ├── Loss: 6.6684 2025-08-28 22:19:28 - pico-train - INFO - ├── Learning Rate: 1.44e-04 2025-08-28 22:19:28 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:20:18 - pico-train - INFO - Step 1300 -- 🔄 Training Metrics 2025-08-28 22:20:18 - pico-train - INFO - ├── Loss: 6.4754 2025-08-28 22:20:18 - pico-train - INFO - ├── Learning Rate: 1.56e-04 2025-08-28 22:20:18 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:21:09 - pico-train - INFO - Step 1400 -- 🔄 Training Metrics 2025-08-28 22:21:09 - pico-train - INFO - ├── Loss: 6.3649 2025-08-28 22:21:09 - pico-train - INFO - ├── Learning Rate: 1.68e-04 2025-08-28 22:21:09 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:22:00 - pico-train - INFO - Step 1500 -- 🔄 Training Metrics 2025-08-28 22:22:00 - pico-train - INFO - ├── Loss: 6.2981 2025-08-28 22:22:00 - pico-train - INFO - ├── Learning Rate: 1.80e-04 2025-08-28 22:22:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:22:51 - pico-train - INFO - Step 1600 -- 🔄 Training Metrics 2025-08-28 22:22:51 - pico-train - INFO - ├── Loss: 6.1551 2025-08-28 22:22:51 - pico-train - INFO - ├── Learning Rate: 1.92e-04 2025-08-28 22:22:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:23:42 - pico-train - INFO - Step 1700 -- 🔄 Training Metrics 2025-08-28 22:23:42 - pico-train - INFO - ├── Loss: 5.9163 2025-08-28 22:23:42 - pico-train - INFO - ├── Learning Rate: 2.04e-04 2025-08-28 22:23:42 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:24:09 - pico-train - INFO - Step 1755 -- 💾 Saving Final Checkpoint 2025-08-28 22:26:24 - pico-train - INFO - Step 1755 -- 📊 Evaluation Results 2025-08-28 22:26:24 - pico-train - INFO - └── paloma: 2.945795672816324e+21 2025-08-28 22:26:24 - pico-train - INFO - 🎉 Training complete! Final step: 1755 2025-08-28 22:26:24 - pico-train - WARNING - Note: Training stopped before max steps (200000)