This is the first Pre-trained version of the Qwen3-10M model on 1 eoch of a 400K sample subset from facebooks recycling the web dataset.
This phase 1 ppre-traininng traines the base model with a context size of 246, in phase 2 it will be extended to 1024 tokens.
My hardware is a M4 MacMini with 32GB RAM.
Stats:
bach_size = 32
val_batches = 16
epochs 1
optimizer_name = "adamw"
scheduler = "cosine_decay"
WandB runn:
- Downloads last month
- 23
