File size: 10,036 Bytes
6557434
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
2025-08-30 18:41:06 - pico-train - INFO - Step 0 -- ๐Ÿ“Š Evaluation Results
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€โ”€ paloma: inf
2025-08-30 18:41:06 - pico-train - INFO - ==================================================
2025-08-30 18:41:06 - pico-train - INFO - โœจ Training Configuration
2025-08-30 18:41:06 - pico-train - INFO - ==================================================
2025-08-30 18:41:06 - pico-train - INFO - โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ checkpointing:                                      โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   checkpoints_dir: checkpoints                      โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   evaluation:                                       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     eval_results_dir: eval_results                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   fabric_checkpoint_dir: fabric_state               โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   fabric_checkpoint_filename: checkpoint.pt         โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   hf_checkpoint:                                    โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     collection_slug: null                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     repo_id: ThomasTheMaker/pico-decoder-tiny       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   learning_dynamics:                                โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     batch_size: 1                                   โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     eval_data: null                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     layer_suffixes:                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     - attention.v_proj                              โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     - attention.o_proj                              โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     - swiglu.w_2                                    โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     sequence_idx: -1                                โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   learning_dynamics_dir: learning_dynamics          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   logs_dir: logs                                    โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   run_name: pico-decoder-tiny-wikipedia_en-v1       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   runs_dir: runs                                    โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   save_every_n_steps: 2000                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   save_to_hf: false                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   training:                                         โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     auto_resume: true                               โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ data:                                               โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   dataloader:                                       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     batch_size: 16                                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   dataset:                                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     name: ThomasTheMaker/pretokenized_wiki_en       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   tokenizer:                                        โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     name: allenai/OLMo-7B-0724-hf                   โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     vocab_size: 50304                               โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ evaluation:                                         โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   metrics:                                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   - paloma                                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   paloma:                                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     batch_size: 1                                   โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     dataset_name: pico-lm/pretokenized-paloma-tinsy โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     dataset_split: val                              โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     max_length: 2048                                โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ model:                                              โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   activation_hidden_dim: 384                        โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   attention_n_heads: 12                             โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   attention_n_kv_heads: 4                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   batch_size: 1024                                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   d_model: 96                                       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   max_seq_len: 2048                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   model_type: pico_decoder                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   n_layers: 12                                      โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   norm_eps: 1.0e-06                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   position_emb_theta: 10000.0                       โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   vocab_size: 50304                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ monitoring:                                         โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   logging:                                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     log_every_n_steps: 100                          โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     log_level: INFO                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   save_to_wandb: false                              โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   wandb:                                            โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     entity: boymyc                                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     project: pico-decoder-tiny                      โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚ training:                                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   fabric:                                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     accelerator: cuda                               โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     num_devices: 1                                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     num_nodes: 1                                    โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     precision: bf16-mixed                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   max_steps: 100000                                 โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚   optimization:                                     โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     gradient_accumulation_steps: 1                  โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     lr: 0.0002                                      โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     lr_scheduler: cosine                            โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     lr_warmup_steps: 2000                           โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚     optimizer: adamw                                โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ”‚                                                     โ”‚
2025-08-30 18:41:06 - pico-train - INFO - โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
2025-08-30 18:41:06 - pico-train - INFO - ==================================================
2025-08-30 18:41:06 - pico-train - INFO - โ›ญ Runtime Summary:
2025-08-30 18:41:06 - pico-train - INFO - ==================================================
2025-08-30 18:41:06 - pico-train - INFO - Starting from step: 0
2025-08-30 18:41:06 - pico-train - INFO - Model Setup:
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Total Parameters: 11,282,784
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Trainable Parameters: 11,282,784
2025-08-30 18:41:06 - pico-train - INFO - Distributed Setup:
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Number of Devices: 1
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Device Type: NVIDIA GeForce RTX 5090
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Available Memory: 33.68 GB
2025-08-30 18:41:06 - pico-train - INFO - Software Setup:
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Python Version: 3.10.12
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ PyTorch Version: 2.8.0+cu128
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ CUDA Version: 12.8
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Operating System: Linux 6.8.0-63-generic
2025-08-30 18:41:06 - pico-train - INFO - Batch Size Configuration:
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Global Batch Size: 16
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Per Device Batch Size: 16
2025-08-30 18:41:06 - pico-train - INFO - โ””โ”€ Gradient Accumulation Steps: 1
2025-08-30 18:41:06 - pico-train - INFO - ==================================================