File size: 15,985 Bytes
ce2c393
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
2025-08-30 01:43:03 - pico-train - INFO - Step 32000 -- ๐Ÿ“Š Evaluation Results
2025-08-30 01:43:03 - pico-train - INFO - โ””โ”€โ”€ paloma: 2.977755235898109e+26
2025-08-30 01:43:05 - pico-train - INFO - ==================================================
2025-08-30 01:43:05 - pico-train - INFO - โœจ Training Configuration
2025-08-30 01:43:05 - pico-train - INFO - ==================================================
2025-08-30 01:43:05 - pico-train - INFO - โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ checkpointing:                                      โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   checkpoints_dir: checkpoints                      โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   evaluation:                                       โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     eval_results_dir: eval_results                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   fabric_checkpoint_dir: fabric_state               โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   fabric_checkpoint_filename: checkpoint.pt         โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   hf_checkpoint:                                    โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     collection_slug: null                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     repo_id: ThomasTheMaker/pico-decoder-tiny       โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   learning_dynamics:                                โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     batch_size: 1                                   โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     eval_data: null                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     layer_suffixes:                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     - attention.v_proj                              โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     - attention.o_proj                              โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     - swiglu.w_2                                    โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     sequence_idx: -1                                โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   learning_dynamics_dir: learning_dynamics          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   logs_dir: logs                                    โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   run_name: pico-decoder-tiny-dolma5M-v1            โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   runs_dir: runs                                    โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   save_every_n_steps: 500                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   save_to_hf: true                                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   training:                                         โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     auto_resume: true                               โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ data:                                               โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   dataloader:                                       โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     batch_size: 4                                   โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   dataset:                                          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     name: ThomasTheMaker/pretokenized-dolma-5M      โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   tokenizer:                                        โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     name: allenai/OLMo-7B-0724-hf                   โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     vocab_size: 50304                               โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ evaluation:                                         โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   metrics:                                          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   - paloma                                          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   paloma:                                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     batch_size: 1                                   โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     dataset_name: pico-lm/pretokenized-paloma-tinsy โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     dataset_split: val                              โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     max_length: 2048                                โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ model:                                              โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   activation_hidden_dim: 384                        โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   attention_n_heads: 12                             โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   attention_n_kv_heads: 4                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   batch_size: 1024                                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   d_model: 96                                       โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   max_seq_len: 2048                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   model_type: pico_decoder                          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   n_layers: 12                                      โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   norm_eps: 1.0e-06                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   position_emb_theta: 10000.0                       โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   vocab_size: 50304                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ monitoring:                                         โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   logging:                                          โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     log_every_n_steps: 25                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     log_level: INFO                                 โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   save_to_wandb: false                              โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   wandb:                                            โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     entity: boymyc                                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     project: pico-decoder-tiny                      โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚ training:                                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   fabric:                                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     accelerator: cuda                               โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     num_devices: 1                                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     num_nodes: 1                                    โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     precision: bf16-mixed                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   max_steps: 20000                                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚   optimization:                                     โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     gradient_accumulation_steps: 4                  โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     lr: 5.0e-05                                     โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     lr_scheduler: cosine                            โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     lr_warmup_steps: 8000                           โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚     optimizer: adamw                                โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ”‚                                                     โ”‚
2025-08-30 01:43:05 - pico-train - INFO - โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
2025-08-30 01:43:05 - pico-train - INFO - ==================================================
2025-08-30 01:43:05 - pico-train - INFO - โ›ญ Runtime Summary:
2025-08-30 01:43:05 - pico-train - INFO - ==================================================
2025-08-30 01:43:05 - pico-train - INFO - Starting from step: 32000
2025-08-30 01:43:05 - pico-train - INFO - Model Setup:
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Total Parameters: 11,282,784
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Trainable Parameters: 11,282,784
2025-08-30 01:43:05 - pico-train - INFO - Distributed Setup:
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Number of Devices: 1
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Device Type: NVIDIA GeForce RTX 5090
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Available Memory: 33.68 GB
2025-08-30 01:43:05 - pico-train - INFO - Software Setup:
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Python Version: 3.10.12
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ PyTorch Version: 2.8.0+cu128
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ CUDA Version: 12.8
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Operating System: Linux 6.8.0-63-generic
2025-08-30 01:43:05 - pico-train - INFO - Batch Size Configuration:
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Global Batch Size: 4
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Per Device Batch Size: 1
2025-08-30 01:43:05 - pico-train - INFO - โ””โ”€ Gradient Accumulation Steps: 4
2025-08-30 01:43:05 - pico-train - INFO - ==================================================
2025-08-30 01:43:06 - pico-train - INFO - Step 32000 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:43:06 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.3376
2025-08-30 01:43:06 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.32e-06
2025-08-30 01:43:06 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:43:06 - pico-train - INFO - Step 32000 -- ๐Ÿ“ˆ Saving Learning Dynamics
2025-08-30 01:43:20 - pico-train - INFO - Step 32025 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:43:20 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1999
2025-08-30 01:43:20 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.28e-06
2025-08-30 01:43:20 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:43:33 - pico-train - INFO - Step 32050 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:43:33 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1488
2025-08-30 01:43:33 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.24e-06
2025-08-30 01:43:33 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:43:45 - pico-train - INFO - Step 32075 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:43:45 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.0460
2025-08-30 01:43:45 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.19e-06
2025-08-30 01:43:45 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:43:58 - pico-train - INFO - Step 32100 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:43:58 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1627
2025-08-30 01:43:58 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.15e-06
2025-08-30 01:43:58 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:44:11 - pico-train - INFO - Step 32125 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:44:11 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.2085
2025-08-30 01:44:11 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.11e-06
2025-08-30 01:44:11 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:44:23 - pico-train - INFO - Step 32150 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:44:23 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1659
2025-08-30 01:44:23 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.06e-06
2025-08-30 01:44:23 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:44:36 - pico-train - INFO - Step 32175 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:44:36 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1719
2025-08-30 01:44:36 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 7.02e-06
2025-08-30 01:44:36 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:44:48 - pico-train - INFO - Step 32200 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:44:48 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.2081
2025-08-30 01:44:48 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.98e-06
2025-08-30 01:44:48 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:45:01 - pico-train - INFO - Step 32225 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:45:01 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1955
2025-08-30 01:45:01 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.94e-06
2025-08-30 01:45:01 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:45:14 - pico-train - INFO - Step 32250 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:45:14 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1139
2025-08-30 01:45:14 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.89e-06
2025-08-30 01:45:14 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:45:26 - pico-train - INFO - Step 32275 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:45:26 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1075
2025-08-30 01:45:26 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.85e-06
2025-08-30 01:45:26 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:45:39 - pico-train - INFO - Step 32300 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:45:39 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.0814
2025-08-30 01:45:39 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.81e-06
2025-08-30 01:45:39 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:45:51 - pico-train - INFO - Step 32325 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:45:51 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.0880
2025-08-30 01:45:51 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.77e-06
2025-08-30 01:45:51 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:46:04 - pico-train - INFO - Step 32350 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:46:04 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1997
2025-08-30 01:46:04 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.73e-06
2025-08-30 01:46:04 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:46:16 - pico-train - INFO - Step 32375 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:46:16 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1376
2025-08-30 01:46:16 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.68e-06
2025-08-30 01:46:16 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:46:29 - pico-train - INFO - Step 32400 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:46:29 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1077
2025-08-30 01:46:29 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.64e-06
2025-08-30 01:46:29 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:46:42 - pico-train - INFO - Step 32425 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:46:42 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.2641
2025-08-30 01:46:42 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.60e-06
2025-08-30 01:46:42 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:46:54 - pico-train - INFO - Step 32450 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:46:54 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.1020
2025-08-30 01:46:54 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.56e-06
2025-08-30 01:46:54 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:47:07 - pico-train - INFO - Step 32475 -- ๐Ÿ”„ Training Metrics
2025-08-30 01:47:07 - pico-train - INFO - โ”œโ”€โ”€ Loss: 6.2170
2025-08-30 01:47:07 - pico-train - INFO - โ”œโ”€โ”€ Learning Rate: 6.52e-06
2025-08-30 01:47:07 - pico-train - INFO - โ””โ”€โ”€ Inf/NaN count: 0
2025-08-30 01:47:19 - pico-train - INFO - Step 32500 -- ๐Ÿ’พ Saving Checkpoint