2025-08-28 22:55:45 - pico-train - INFO - Step 1000 -- 📊 Evaluation Results 2025-08-28 22:55:45 - pico-train - INFO - └── paloma: 2.5468931158531133e+19 2025-08-28 22:55:47 - pico-train - INFO - ================================================== 2025-08-28 22:55:47 - pico-train - INFO - ✨ Training Configuration 2025-08-28 22:55:47 - pico-train - INFO - ================================================== 2025-08-28 22:55:47 - pico-train - INFO - ╭─────────────────────────────────────────────────────╮ 2025-08-28 22:55:47 - pico-train - INFO - │ checkpointing: │ 2025-08-28 22:55:47 - pico-train - INFO - │ checkpoints_dir: checkpoints │ 2025-08-28 22:55:47 - pico-train - INFO - │ evaluation: │ 2025-08-28 22:55:47 - pico-train - INFO - │ eval_results_dir: eval_results │ 2025-08-28 22:55:47 - pico-train - INFO - │ fabric_checkpoint_dir: fabric_state │ 2025-08-28 22:55:47 - pico-train - INFO - │ fabric_checkpoint_filename: checkpoint.pt │ 2025-08-28 22:55:47 - pico-train - INFO - │ hf_checkpoint: │ 2025-08-28 22:55:47 - pico-train - INFO - │ collection_slug: null │ 2025-08-28 22:55:47 - pico-train - INFO - │ repo_id: ThomasTheMaker/pico-decoder-tiny │ 2025-08-28 22:55:47 - pico-train - INFO - │ learning_dynamics: │ 2025-08-28 22:55:47 - pico-train - INFO - │ batch_size: 1 │ 2025-08-28 22:55:47 - pico-train - INFO - │ eval_data: null │ 2025-08-28 22:55:47 - pico-train - INFO - │ layer_suffixes: │ 2025-08-28 22:55:47 - pico-train - INFO - │ - attention.v_proj │ 2025-08-28 22:55:47 - pico-train - INFO - │ - attention.o_proj │ 2025-08-28 22:55:47 - pico-train - INFO - │ - swiglu.w_2 │ 2025-08-28 22:55:47 - pico-train - INFO - │ sequence_idx: -1 │ 2025-08-28 22:55:47 - pico-train - INFO - │ learning_dynamics_dir: learning_dynamics │ 2025-08-28 22:55:47 - pico-train - INFO - │ logs_dir: logs │ 2025-08-28 22:55:47 - pico-train - INFO - │ run_name: pico-decoder-tiny-dolma29k │ 2025-08-28 22:55:47 - pico-train - INFO - │ runs_dir: runs │ 2025-08-28 22:55:47 - pico-train - INFO - │ save_every_n_steps: 1000 │ 2025-08-28 22:55:47 - pico-train - INFO - │ save_to_hf: true │ 2025-08-28 22:55:47 - pico-train - INFO - │ training: │ 2025-08-28 22:55:47 - pico-train - INFO - │ auto_resume: true │ 2025-08-28 22:55:47 - pico-train - INFO - │ data: │ 2025-08-28 22:55:47 - pico-train - INFO - │ dataloader: │ 2025-08-28 22:55:47 - pico-train - INFO - │ batch_size: 4 │ 2025-08-28 22:55:47 - pico-train - INFO - │ dataset: │ 2025-08-28 22:55:47 - pico-train - INFO - │ name: pico-lm/pretokenized-dolma │ 2025-08-28 22:55:47 - pico-train - INFO - │ tokenizer: │ 2025-08-28 22:55:47 - pico-train - INFO - │ name: allenai/OLMo-7B-0724-hf │ 2025-08-28 22:55:47 - pico-train - INFO - │ vocab_size: 50304 │ 2025-08-28 22:55:47 - pico-train - INFO - │ evaluation: │ 2025-08-28 22:55:47 - pico-train - INFO - │ metrics: │ 2025-08-28 22:55:47 - pico-train - INFO - │ - paloma │ 2025-08-28 22:55:47 - pico-train - INFO - │ paloma: │ 2025-08-28 22:55:47 - pico-train - INFO - │ batch_size: 1 │ 2025-08-28 22:55:47 - pico-train - INFO - │ dataset_name: pico-lm/pretokenized-paloma-tinsy │ 2025-08-28 22:55:47 - pico-train - INFO - │ dataset_split: val │ 2025-08-28 22:55:47 - pico-train - INFO - │ max_length: 2048 │ 2025-08-28 22:55:47 - pico-train - INFO - │ model: │ 2025-08-28 22:55:47 - pico-train - INFO - │ activation_hidden_dim: 384 │ 2025-08-28 22:55:47 - pico-train - INFO - │ attention_n_heads: 12 │ 2025-08-28 22:55:47 - pico-train - INFO - │ attention_n_kv_heads: 4 │ 2025-08-28 22:55:47 - pico-train - INFO - │ batch_size: 1024 │ 2025-08-28 22:55:47 - pico-train - INFO - │ d_model: 96 │ 2025-08-28 22:55:47 - pico-train - INFO - │ max_seq_len: 2048 │ 2025-08-28 22:55:47 - pico-train - INFO - │ model_type: pico_decoder │ 2025-08-28 22:55:47 - pico-train - INFO - │ n_layers: 12 │ 2025-08-28 22:55:47 - pico-train - INFO - │ norm_eps: 1.0e-06 │ 2025-08-28 22:55:47 - pico-train - INFO - │ position_emb_theta: 10000.0 │ 2025-08-28 22:55:47 - pico-train - INFO - │ vocab_size: 50304 │ 2025-08-28 22:55:47 - pico-train - INFO - │ monitoring: │ 2025-08-28 22:55:47 - pico-train - INFO - │ logging: │ 2025-08-28 22:55:47 - pico-train - INFO - │ log_every_n_steps: 100 │ 2025-08-28 22:55:47 - pico-train - INFO - │ log_level: INFO │ 2025-08-28 22:55:47 - pico-train - INFO - │ save_to_wandb: false │ 2025-08-28 22:55:47 - pico-train - INFO - │ wandb: │ 2025-08-28 22:55:47 - pico-train - INFO - │ entity: boymyc │ 2025-08-28 22:55:47 - pico-train - INFO - │ project: pico-decoder-tiny │ 2025-08-28 22:55:47 - pico-train - INFO - │ training: │ 2025-08-28 22:55:47 - pico-train - INFO - │ fabric: │ 2025-08-28 22:55:47 - pico-train - INFO - │ accelerator: cuda │ 2025-08-28 22:55:47 - pico-train - INFO - │ num_devices: 1 │ 2025-08-28 22:55:47 - pico-train - INFO - │ num_nodes: 1 │ 2025-08-28 22:55:47 - pico-train - INFO - │ precision: bf16-mixed │ 2025-08-28 22:55:47 - pico-train - INFO - │ max_steps: 200000 │ 2025-08-28 22:55:47 - pico-train - INFO - │ optimization: │ 2025-08-28 22:55:47 - pico-train - INFO - │ gradient_accumulation_steps: 4 │ 2025-08-28 22:55:47 - pico-train - INFO - │ lr: 0.0003 │ 2025-08-28 22:55:47 - pico-train - INFO - │ lr_scheduler: linear_with_warmup │ 2025-08-28 22:55:47 - pico-train - INFO - │ lr_warmup_steps: 2500 │ 2025-08-28 22:55:47 - pico-train - INFO - │ optimizer: adamw │ 2025-08-28 22:55:47 - pico-train - INFO - │ │ 2025-08-28 22:55:47 - pico-train - INFO - ╰─────────────────────────────────────────────────────╯ 2025-08-28 22:55:47 - pico-train - INFO - ================================================== 2025-08-28 22:55:47 - pico-train - INFO - ⛭ Runtime Summary: 2025-08-28 22:55:47 - pico-train - INFO - ================================================== 2025-08-28 22:55:47 - pico-train - INFO - Starting from step: 1000 2025-08-28 22:55:47 - pico-train - INFO - Model Setup: 2025-08-28 22:55:47 - pico-train - INFO - └─ Total Parameters: 11,282,784 2025-08-28 22:55:47 - pico-train - INFO - └─ Trainable Parameters: 11,282,784 2025-08-28 22:55:47 - pico-train - INFO - Distributed Setup: 2025-08-28 22:55:47 - pico-train - INFO - └─ Number of Devices: 1 2025-08-28 22:55:47 - pico-train - INFO - └─ Device Type: NVIDIA GeForce RTX 5090 2025-08-28 22:55:47 - pico-train - INFO - └─ Available Memory: 33.68 GB 2025-08-28 22:55:47 - pico-train - INFO - Software Setup: 2025-08-28 22:55:47 - pico-train - INFO - └─ Python Version: 3.10.12 2025-08-28 22:55:47 - pico-train - INFO - └─ PyTorch Version: 2.8.0+cu128 2025-08-28 22:55:47 - pico-train - INFO - └─ CUDA Version: 12.8 2025-08-28 22:55:47 - pico-train - INFO - └─ Operating System: Linux 6.8.0-63-generic 2025-08-28 22:55:47 - pico-train - INFO - Batch Size Configuration: 2025-08-28 22:55:47 - pico-train - INFO - └─ Global Batch Size: 4 2025-08-28 22:55:47 - pico-train - INFO - └─ Per Device Batch Size: 1 2025-08-28 22:55:47 - pico-train - INFO - └─ Gradient Accumulation Steps: 4 2025-08-28 22:55:47 - pico-train - INFO - ================================================== 2025-08-28 22:55:49 - pico-train - INFO - Step 1000 -- 🔄 Training Metrics 2025-08-28 22:55:49 - pico-train - INFO - ├── Loss: 7.7657 2025-08-28 22:55:49 - pico-train - INFO - ├── Learning Rate: 1.20e-04 2025-08-28 22:55:49 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:55:49 - pico-train - INFO - Step 1000 -- 📈 Saving Learning Dynamics 2025-08-28 22:56:43 - pico-train - INFO - Step 1100 -- 🔄 Training Metrics 2025-08-28 22:56:43 - pico-train - INFO - ├── Loss: 7.6733 2025-08-28 22:56:43 - pico-train - INFO - ├── Learning Rate: 1.32e-04 2025-08-28 22:56:43 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:57:34 - pico-train - INFO - Step 1200 -- 🔄 Training Metrics 2025-08-28 22:57:34 - pico-train - INFO - ├── Loss: 7.5969 2025-08-28 22:57:34 - pico-train - INFO - ├── Learning Rate: 1.44e-04 2025-08-28 22:57:34 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:58:25 - pico-train - INFO - Step 1300 -- 🔄 Training Metrics 2025-08-28 22:58:25 - pico-train - INFO - ├── Loss: 7.4765 2025-08-28 22:58:25 - pico-train - INFO - ├── Learning Rate: 1.56e-04 2025-08-28 22:58:25 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 22:59:16 - pico-train - INFO - Step 1400 -- 🔄 Training Metrics 2025-08-28 22:59:16 - pico-train - INFO - ├── Loss: 7.3686 2025-08-28 22:59:16 - pico-train - INFO - ├── Learning Rate: 1.68e-04 2025-08-28 22:59:16 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:00:07 - pico-train - INFO - Step 1500 -- 🔄 Training Metrics 2025-08-28 23:00:07 - pico-train - INFO - ├── Loss: 7.3251 2025-08-28 23:00:07 - pico-train - INFO - ├── Learning Rate: 1.80e-04 2025-08-28 23:00:07 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:00:58 - pico-train - INFO - Step 1600 -- 🔄 Training Metrics 2025-08-28 23:00:58 - pico-train - INFO - ├── Loss: 7.1840 2025-08-28 23:00:58 - pico-train - INFO - ├── Learning Rate: 1.92e-04 2025-08-28 23:00:58 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:01:50 - pico-train - INFO - Step 1700 -- 🔄 Training Metrics 2025-08-28 23:01:50 - pico-train - INFO - ├── Loss: 7.1116 2025-08-28 23:01:50 - pico-train - INFO - ├── Learning Rate: 2.04e-04 2025-08-28 23:01:50 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:02:41 - pico-train - INFO - Step 1800 -- 🔄 Training Metrics 2025-08-28 23:02:41 - pico-train - INFO - ├── Loss: 7.0565 2025-08-28 23:02:41 - pico-train - INFO - ├── Learning Rate: 2.16e-04 2025-08-28 23:02:41 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:03:32 - pico-train - INFO - Step 1900 -- 🔄 Training Metrics 2025-08-28 23:03:32 - pico-train - INFO - ├── Loss: 6.9964 2025-08-28 23:03:32 - pico-train - INFO - ├── Learning Rate: 2.28e-04 2025-08-28 23:03:32 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:04:23 - pico-train - INFO - Step 2000 -- 💾 Saving Checkpoint 2025-08-28 23:06:18 - pico-train - INFO - Step 2000 -- 📊 Evaluation Results 2025-08-28 23:06:18 - pico-train - INFO - └── paloma: 3.627192449295412e+21 2025-08-28 23:06:21 - pico-train - INFO - Step 2000 -- 🔄 Training Metrics 2025-08-28 23:06:21 - pico-train - INFO - ├── Loss: 6.9690 2025-08-28 23:06:21 - pico-train - INFO - ├── Learning Rate: 2.40e-04 2025-08-28 23:06:21 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:06:21 - pico-train - INFO - Step 2000 -- 📈 Saving Learning Dynamics 2025-08-28 23:07:15 - pico-train - INFO - Step 2100 -- 🔄 Training Metrics 2025-08-28 23:07:15 - pico-train - INFO - ├── Loss: 6.8840 2025-08-28 23:07:15 - pico-train - INFO - ├── Learning Rate: 2.52e-04 2025-08-28 23:07:15 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:08:06 - pico-train - INFO - Step 2200 -- 🔄 Training Metrics 2025-08-28 23:08:06 - pico-train - INFO - ├── Loss: 6.8334 2025-08-28 23:08:06 - pico-train - INFO - ├── Learning Rate: 2.64e-04 2025-08-28 23:08:06 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:08:57 - pico-train - INFO - Step 2300 -- 🔄 Training Metrics 2025-08-28 23:08:57 - pico-train - INFO - ├── Loss: 6.8150 2025-08-28 23:08:57 - pico-train - INFO - ├── Learning Rate: 2.76e-04 2025-08-28 23:08:57 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:09:48 - pico-train - INFO - Step 2400 -- 🔄 Training Metrics 2025-08-28 23:09:48 - pico-train - INFO - ├── Loss: 6.7519 2025-08-28 23:09:48 - pico-train - INFO - ├── Learning Rate: 2.88e-04 2025-08-28 23:09:48 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:10:39 - pico-train - INFO - Step 2500 -- 🔄 Training Metrics 2025-08-28 23:10:39 - pico-train - INFO - ├── Loss: 6.6908 2025-08-28 23:10:39 - pico-train - INFO - ├── Learning Rate: 3.00e-04 2025-08-28 23:10:39 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:11:30 - pico-train - INFO - Step 2600 -- 🔄 Training Metrics 2025-08-28 23:11:30 - pico-train - INFO - ├── Loss: 6.6351 2025-08-28 23:11:30 - pico-train - INFO - ├── Learning Rate: 3.00e-04 2025-08-28 23:11:30 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:12:21 - pico-train - INFO - Step 2700 -- 🔄 Training Metrics 2025-08-28 23:12:21 - pico-train - INFO - ├── Loss: 6.5568 2025-08-28 23:12:21 - pico-train - INFO - ├── Learning Rate: 3.00e-04 2025-08-28 23:12:21 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:13:12 - pico-train - INFO - Step 2800 -- 🔄 Training Metrics 2025-08-28 23:13:12 - pico-train - INFO - ├── Loss: 6.5799 2025-08-28 23:13:12 - pico-train - INFO - ├── Learning Rate: 3.00e-04 2025-08-28 23:13:12 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:14:03 - pico-train - INFO - Step 2900 -- 🔄 Training Metrics 2025-08-28 23:14:03 - pico-train - INFO - ├── Loss: 6.5467 2025-08-28 23:14:03 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:14:03 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:14:53 - pico-train - INFO - Step 3000 -- 💾 Saving Checkpoint 2025-08-28 23:16:58 - pico-train - INFO - Step 3000 -- 📊 Evaluation Results 2025-08-28 23:16:58 - pico-train - INFO - └── paloma: 9.90975658825673e+22 2025-08-28 23:17:01 - pico-train - INFO - Step 3000 -- 🔄 Training Metrics 2025-08-28 23:17:01 - pico-train - INFO - ├── Loss: 6.4865 2025-08-28 23:17:01 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:17:01 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:17:01 - pico-train - INFO - Step 3000 -- 📈 Saving Learning Dynamics 2025-08-28 23:17:55 - pico-train - INFO - Step 3100 -- 🔄 Training Metrics 2025-08-28 23:17:55 - pico-train - INFO - ├── Loss: 6.4604 2025-08-28 23:17:55 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:17:55 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:18:46 - pico-train - INFO - Step 3200 -- 🔄 Training Metrics 2025-08-28 23:18:46 - pico-train - INFO - ├── Loss: 6.4205 2025-08-28 23:18:46 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:18:46 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:19:36 - pico-train - INFO - Step 3300 -- 🔄 Training Metrics 2025-08-28 23:19:36 - pico-train - INFO - ├── Loss: 6.4127 2025-08-28 23:19:36 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:19:36 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:20:27 - pico-train - INFO - Step 3400 -- 🔄 Training Metrics 2025-08-28 23:20:27 - pico-train - INFO - ├── Loss: 6.3692 2025-08-28 23:20:27 - pico-train - INFO - ├── Learning Rate: 2.99e-04 2025-08-28 23:20:27 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:21:18 - pico-train - INFO - Step 3500 -- 🔄 Training Metrics 2025-08-28 23:21:18 - pico-train - INFO - ├── Loss: 6.3761 2025-08-28 23:21:18 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:21:18 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:22:09 - pico-train - INFO - Step 3600 -- 🔄 Training Metrics 2025-08-28 23:22:09 - pico-train - INFO - ├── Loss: 6.2796 2025-08-28 23:22:09 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:22:09 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:23:00 - pico-train - INFO - Step 3700 -- 🔄 Training Metrics 2025-08-28 23:23:00 - pico-train - INFO - ├── Loss: 6.2988 2025-08-28 23:23:00 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:23:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:23:51 - pico-train - INFO - Step 3800 -- 🔄 Training Metrics 2025-08-28 23:23:51 - pico-train - INFO - ├── Loss: 6.2673 2025-08-28 23:23:51 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:23:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:24:42 - pico-train - INFO - Step 3900 -- 🔄 Training Metrics 2025-08-28 23:24:42 - pico-train - INFO - ├── Loss: 6.2715 2025-08-28 23:24:42 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:24:42 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:25:32 - pico-train - INFO - Step 4000 -- 💾 Saving Checkpoint 2025-08-28 23:27:27 - pico-train - INFO - Step 4000 -- 📊 Evaluation Results 2025-08-28 23:27:27 - pico-train - INFO - └── paloma: 2.6252526658823776e+24 2025-08-28 23:27:29 - pico-train - INFO - Step 4000 -- 🔄 Training Metrics 2025-08-28 23:27:29 - pico-train - INFO - ├── Loss: 6.1890 2025-08-28 23:27:29 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:27:29 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:27:29 - pico-train - INFO - Step 4000 -- 📈 Saving Learning Dynamics 2025-08-28 23:28:23 - pico-train - INFO - Step 4100 -- 🔄 Training Metrics 2025-08-28 23:28:23 - pico-train - INFO - ├── Loss: 6.1832 2025-08-28 23:28:23 - pico-train - INFO - ├── Learning Rate: 2.98e-04 2025-08-28 23:28:23 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:29:13 - pico-train - INFO - Step 4200 -- 🔄 Training Metrics 2025-08-28 23:29:13 - pico-train - INFO - ├── Loss: 6.1553 2025-08-28 23:29:13 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:29:13 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:30:04 - pico-train - INFO - Step 4300 -- 🔄 Training Metrics 2025-08-28 23:30:04 - pico-train - INFO - ├── Loss: 6.1629 2025-08-28 23:30:04 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:30:04 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:30:56 - pico-train - INFO - Step 4400 -- 🔄 Training Metrics 2025-08-28 23:30:56 - pico-train - INFO - ├── Loss: 6.1061 2025-08-28 23:30:56 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:30:56 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:31:47 - pico-train - INFO - Step 4500 -- 🔄 Training Metrics 2025-08-28 23:31:47 - pico-train - INFO - ├── Loss: 6.1601 2025-08-28 23:31:47 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:31:47 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:32:38 - pico-train - INFO - Step 4600 -- 🔄 Training Metrics 2025-08-28 23:32:38 - pico-train - INFO - ├── Loss: 6.0963 2025-08-28 23:32:38 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:32:38 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:33:29 - pico-train - INFO - Step 4700 -- 🔄 Training Metrics 2025-08-28 23:33:29 - pico-train - INFO - ├── Loss: 6.0780 2025-08-28 23:33:29 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:33:29 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:34:20 - pico-train - INFO - Step 4800 -- 🔄 Training Metrics 2025-08-28 23:34:20 - pico-train - INFO - ├── Loss: 6.0835 2025-08-28 23:34:20 - pico-train - INFO - ├── Learning Rate: 2.97e-04 2025-08-28 23:34:20 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:35:11 - pico-train - INFO - Step 4900 -- 🔄 Training Metrics 2025-08-28 23:35:11 - pico-train - INFO - ├── Loss: 6.0519 2025-08-28 23:35:11 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:35:11 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:36:01 - pico-train - INFO - Step 5000 -- 💾 Saving Checkpoint 2025-08-28 23:38:14 - pico-train - INFO - Step 5000 -- 📊 Evaluation Results 2025-08-28 23:38:14 - pico-train - INFO - └── paloma: 7.294956881845611e+25 2025-08-28 23:38:16 - pico-train - INFO - Step 5000 -- 🔄 Training Metrics 2025-08-28 23:38:16 - pico-train - INFO - ├── Loss: 6.0661 2025-08-28 23:38:16 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:38:16 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:38:16 - pico-train - INFO - Step 5000 -- 📈 Saving Learning Dynamics 2025-08-28 23:39:10 - pico-train - INFO - Step 5100 -- 🔄 Training Metrics 2025-08-28 23:39:10 - pico-train - INFO - ├── Loss: 6.0121 2025-08-28 23:39:10 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:39:10 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:40:02 - pico-train - INFO - Step 5200 -- 🔄 Training Metrics 2025-08-28 23:40:02 - pico-train - INFO - ├── Loss: 6.0544 2025-08-28 23:40:02 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:40:02 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:40:53 - pico-train - INFO - Step 5300 -- 🔄 Training Metrics 2025-08-28 23:40:53 - pico-train - INFO - ├── Loss: 6.0224 2025-08-28 23:40:53 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:40:53 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:41:44 - pico-train - INFO - Step 5400 -- 🔄 Training Metrics 2025-08-28 23:41:44 - pico-train - INFO - ├── Loss: 5.9831 2025-08-28 23:41:44 - pico-train - INFO - ├── Learning Rate: 2.96e-04 2025-08-28 23:41:44 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:42:35 - pico-train - INFO - Step 5500 -- 🔄 Training Metrics 2025-08-28 23:42:35 - pico-train - INFO - ├── Loss: 5.9553 2025-08-28 23:42:35 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:42:35 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:43:26 - pico-train - INFO - Step 5600 -- 🔄 Training Metrics 2025-08-28 23:43:26 - pico-train - INFO - ├── Loss: 5.9493 2025-08-28 23:43:26 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:43:26 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:44:17 - pico-train - INFO - Step 5700 -- 🔄 Training Metrics 2025-08-28 23:44:17 - pico-train - INFO - ├── Loss: 5.9943 2025-08-28 23:44:17 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:44:17 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:45:08 - pico-train - INFO - Step 5800 -- 🔄 Training Metrics 2025-08-28 23:45:08 - pico-train - INFO - ├── Loss: 5.9630 2025-08-28 23:45:08 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:45:08 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:46:00 - pico-train - INFO - Step 5900 -- 🔄 Training Metrics 2025-08-28 23:46:00 - pico-train - INFO - ├── Loss: 5.9349 2025-08-28 23:46:00 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:46:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:46:50 - pico-train - INFO - Step 6000 -- 💾 Saving Checkpoint 2025-08-28 23:48:48 - pico-train - INFO - Step 6000 -- 📊 Evaluation Results 2025-08-28 23:48:48 - pico-train - INFO - └── paloma: 1.6856570425562805e+27 2025-08-28 23:48:50 - pico-train - INFO - Step 6000 -- 🔄 Training Metrics 2025-08-28 23:48:50 - pico-train - INFO - ├── Loss: 5.9087 2025-08-28 23:48:50 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:48:50 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:48:50 - pico-train - INFO - Step 6000 -- 📈 Saving Learning Dynamics 2025-08-28 23:49:44 - pico-train - INFO - Step 6100 -- 🔄 Training Metrics 2025-08-28 23:49:44 - pico-train - INFO - ├── Loss: 5.8818 2025-08-28 23:49:44 - pico-train - INFO - ├── Learning Rate: 2.95e-04 2025-08-28 23:49:44 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:50:35 - pico-train - INFO - Step 6200 -- 🔄 Training Metrics 2025-08-28 23:50:35 - pico-train - INFO - ├── Loss: 5.8535 2025-08-28 23:50:35 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:50:35 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:51:26 - pico-train - INFO - Step 6300 -- 🔄 Training Metrics 2025-08-28 23:51:26 - pico-train - INFO - ├── Loss: 5.8896 2025-08-28 23:51:26 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:51:26 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:52:18 - pico-train - INFO - Step 6400 -- 🔄 Training Metrics 2025-08-28 23:52:18 - pico-train - INFO - ├── Loss: 5.9007 2025-08-28 23:52:18 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:52:18 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:53:09 - pico-train - INFO - Step 6500 -- 🔄 Training Metrics 2025-08-28 23:53:09 - pico-train - INFO - ├── Loss: 5.8617 2025-08-28 23:53:09 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:53:09 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:54:00 - pico-train - INFO - Step 6600 -- 🔄 Training Metrics 2025-08-28 23:54:00 - pico-train - INFO - ├── Loss: 5.8201 2025-08-28 23:54:00 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:54:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:54:51 - pico-train - INFO - Step 6700 -- 🔄 Training Metrics 2025-08-28 23:54:51 - pico-train - INFO - ├── Loss: 5.8544 2025-08-28 23:54:51 - pico-train - INFO - ├── Learning Rate: 2.94e-04 2025-08-28 23:54:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:55:42 - pico-train - INFO - Step 6800 -- 🔄 Training Metrics 2025-08-28 23:55:42 - pico-train - INFO - ├── Loss: 5.8532 2025-08-28 23:55:42 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-28 23:55:42 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:56:33 - pico-train - INFO - Step 6900 -- 🔄 Training Metrics 2025-08-28 23:56:33 - pico-train - INFO - ├── Loss: 5.7950 2025-08-28 23:56:33 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-28 23:56:33 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:57:24 - pico-train - INFO - Step 7000 -- 💾 Saving Checkpoint 2025-08-28 23:59:22 - pico-train - INFO - Step 7000 -- 📊 Evaluation Results 2025-08-28 23:59:22 - pico-train - INFO - └── paloma: 9.22180682233585e+28 2025-08-28 23:59:23 - pico-train - INFO - Step 7000 -- 🔄 Training Metrics 2025-08-28 23:59:23 - pico-train - INFO - ├── Loss: 5.8146 2025-08-28 23:59:23 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-28 23:59:23 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-28 23:59:23 - pico-train - INFO - Step 7000 -- 📈 Saving Learning Dynamics 2025-08-29 00:00:17 - pico-train - INFO - Step 7100 -- 🔄 Training Metrics 2025-08-29 00:00:17 - pico-train - INFO - ├── Loss: 5.7930 2025-08-29 00:00:17 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-29 00:00:17 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:01:09 - pico-train - INFO - Step 7200 -- 🔄 Training Metrics 2025-08-29 00:01:09 - pico-train - INFO - ├── Loss: 5.7827 2025-08-29 00:01:09 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-29 00:01:09 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:02:00 - pico-train - INFO - Step 7300 -- 🔄 Training Metrics 2025-08-29 00:02:00 - pico-train - INFO - ├── Loss: 5.7816 2025-08-29 00:02:00 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-29 00:02:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:02:51 - pico-train - INFO - Step 7400 -- 🔄 Training Metrics 2025-08-29 00:02:51 - pico-train - INFO - ├── Loss: 5.7300 2025-08-29 00:02:51 - pico-train - INFO - ├── Learning Rate: 2.93e-04 2025-08-29 00:02:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:03:42 - pico-train - INFO - Step 7500 -- 🔄 Training Metrics 2025-08-29 00:03:42 - pico-train - INFO - ├── Loss: 5.7670 2025-08-29 00:03:42 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:03:42 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:04:33 - pico-train - INFO - Step 7600 -- 🔄 Training Metrics 2025-08-29 00:04:33 - pico-train - INFO - ├── Loss: 5.7450 2025-08-29 00:04:33 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:04:33 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:05:25 - pico-train - INFO - Step 7700 -- 🔄 Training Metrics 2025-08-29 00:05:25 - pico-train - INFO - ├── Loss: 5.7499 2025-08-29 00:05:25 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:05:25 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:06:16 - pico-train - INFO - Step 7800 -- 🔄 Training Metrics 2025-08-29 00:06:16 - pico-train - INFO - ├── Loss: 5.7233 2025-08-29 00:06:16 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:06:16 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:07:07 - pico-train - INFO - Step 7900 -- 🔄 Training Metrics 2025-08-29 00:07:07 - pico-train - INFO - ├── Loss: 5.7219 2025-08-29 00:07:07 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:07:07 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:07:57 - pico-train - INFO - Step 8000 -- 💾 Saving Checkpoint 2025-08-29 00:10:09 - pico-train - INFO - Step 8000 -- 📊 Evaluation Results 2025-08-29 00:10:09 - pico-train - INFO - └── paloma: 3.1300823362207656e+29 2025-08-29 00:10:11 - pico-train - INFO - Step 8000 -- 🔄 Training Metrics 2025-08-29 00:10:11 - pico-train - INFO - ├── Loss: 5.7523 2025-08-29 00:10:11 - pico-train - INFO - ├── Learning Rate: 2.92e-04 2025-08-29 00:10:11 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:10:11 - pico-train - INFO - Step 8000 -- 📈 Saving Learning Dynamics 2025-08-29 00:11:05 - pico-train - INFO - Step 8100 -- 🔄 Training Metrics 2025-08-29 00:11:05 - pico-train - INFO - ├── Loss: 5.7145 2025-08-29 00:11:05 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:11:05 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:11:57 - pico-train - INFO - Step 8200 -- 🔄 Training Metrics 2025-08-29 00:11:57 - pico-train - INFO - ├── Loss: 5.7469 2025-08-29 00:11:57 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:11:57 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:12:48 - pico-train - INFO - Step 8300 -- 🔄 Training Metrics 2025-08-29 00:12:48 - pico-train - INFO - ├── Loss: 5.7363 2025-08-29 00:12:48 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:12:48 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:13:38 - pico-train - INFO - Step 8400 -- 🔄 Training Metrics 2025-08-29 00:13:38 - pico-train - INFO - ├── Loss: 5.6938 2025-08-29 00:13:38 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:13:38 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:14:29 - pico-train - INFO - Step 8500 -- 🔄 Training Metrics 2025-08-29 00:14:29 - pico-train - INFO - ├── Loss: 5.6994 2025-08-29 00:14:29 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:14:29 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:15:20 - pico-train - INFO - Step 8600 -- 🔄 Training Metrics 2025-08-29 00:15:20 - pico-train - INFO - ├── Loss: 5.6583 2025-08-29 00:15:20 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:15:20 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:16:11 - pico-train - INFO - Step 8700 -- 🔄 Training Metrics 2025-08-29 00:16:11 - pico-train - INFO - ├── Loss: 5.6885 2025-08-29 00:16:11 - pico-train - INFO - ├── Learning Rate: 2.91e-04 2025-08-29 00:16:11 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:17:02 - pico-train - INFO - Step 8800 -- 🔄 Training Metrics 2025-08-29 00:17:02 - pico-train - INFO - ├── Loss: 5.6313 2025-08-29 00:17:02 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:17:02 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:17:53 - pico-train - INFO - Step 8900 -- 🔄 Training Metrics 2025-08-29 00:17:53 - pico-train - INFO - ├── Loss: 5.6314 2025-08-29 00:17:53 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:17:53 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:18:44 - pico-train - INFO - Step 9000 -- 💾 Saving Checkpoint 2025-08-29 00:20:42 - pico-train - INFO - Step 9000 -- 📊 Evaluation Results 2025-08-29 00:20:42 - pico-train - INFO - └── paloma: 4.983924509492406e+30 2025-08-29 00:20:43 - pico-train - INFO - Step 9000 -- 🔄 Training Metrics 2025-08-29 00:20:43 - pico-train - INFO - ├── Loss: 5.6501 2025-08-29 00:20:43 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:20:43 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:20:43 - pico-train - INFO - Step 9000 -- 📈 Saving Learning Dynamics 2025-08-29 00:21:37 - pico-train - INFO - Step 9100 -- 🔄 Training Metrics 2025-08-29 00:21:37 - pico-train - INFO - ├── Loss: 5.6357 2025-08-29 00:21:37 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:21:37 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:22:28 - pico-train - INFO - Step 9200 -- 🔄 Training Metrics 2025-08-29 00:22:28 - pico-train - INFO - ├── Loss: 5.6045 2025-08-29 00:22:28 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:22:28 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:23:19 - pico-train - INFO - Step 9300 -- 🔄 Training Metrics 2025-08-29 00:23:19 - pico-train - INFO - ├── Loss: 5.6405 2025-08-29 00:23:19 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:23:19 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:24:10 - pico-train - INFO - Step 9400 -- 🔄 Training Metrics 2025-08-29 00:24:10 - pico-train - INFO - ├── Loss: 5.6241 2025-08-29 00:24:10 - pico-train - INFO - ├── Learning Rate: 2.90e-04 2025-08-29 00:24:10 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:25:00 - pico-train - INFO - Step 9500 -- 🔄 Training Metrics 2025-08-29 00:25:00 - pico-train - INFO - ├── Loss: 5.6247 2025-08-29 00:25:00 - pico-train - INFO - ├── Learning Rate: 2.89e-04 2025-08-29 00:25:00 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:25:51 - pico-train - INFO - Step 9600 -- 🔄 Training Metrics 2025-08-29 00:25:51 - pico-train - INFO - ├── Loss: 5.5983 2025-08-29 00:25:51 - pico-train - INFO - ├── Learning Rate: 2.89e-04 2025-08-29 00:25:51 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:26:43 - pico-train - INFO - Step 9700 -- 🔄 Training Metrics 2025-08-29 00:26:43 - pico-train - INFO - ├── Loss: 5.5978 2025-08-29 00:26:43 - pico-train - INFO - ├── Learning Rate: 2.89e-04 2025-08-29 00:26:43 - pico-train - INFO - └── Inf/NaN count: 0 2025-08-29 00:27:34 - pico-train - INFO - Step 9800 -- 🔄 Training Metrics 2025-08-29 00:27:34 - pico-train - INFO - ├── Loss: 5.5746 2025-08-29 00:27:34 - pico-train - INFO - ├── Learning Rate: 2.89e-04 2025-08-29 00:27:34 - pico-train - INFO - └── Inf/NaN count: 0