Draft Models
Collection
Tiny "draft" models for speculative decoding.
•
36 items
•
Updated
•
6
A 0.8B parameter draft (speculative decoding) model for use with command-a-03-2025.
See command-a-03-2025-DRAFT-0.8B-v3.0-GGUF for the models in gguf format for use with llama.cpp.
The current config.json is set for context length up to 32k tokens. Add the "rope_scaling" section to config.json to enable YaRN, eg:
"max_position_embeddings": 65536,
...
"rope_scaling": {
"factor": 2.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
"max_position_embeddings": 131072,
...
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
},
NOTE: Because llama.cpp uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the rope_scaling configuration when processing long contexts is required...
> python ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./command-a-03-2025 \
./command-a-03-2025-DRAFT-0.8B-UNTRAINED \
--override "<PAD>" "<|endoftext|>" \
--override "<UNK>" "<|endoftext|>" \
--override "<CLS>" "<|endoftext|>" \
--override "<SEP>" "<|endoftext|>" \
--override "<MASK_TOKEN>" "<|endoftext|>" \
--override "<BOS_TOKEN>" "<|endoftext|>" \
--override "<EOS_TOKEN>" "<|endoftext|>" \
--override "<EOP_TOKEN>" "<|endoftext|>" \
--override "<|START_OF_TURN_TOKEN|>" "<|im_start|>" \
--override "<|END_OF_TURN_TOKEN|>" "<|im_end|>" \
--override "<|YES_TOKEN|>" "<|endoftext|>" \
--override "<|NO_TOKEN|>" "<|endoftext|>" \
--override "<|GOOD_TOKEN|>" "<|endoftext|>" \
--override "<|BAD_TOKEN|>" "<|endoftext|>" \
--override "<|USER_TOKEN|>" "user\n" \
--override "<|CHATBOT_TOKEN|>" "assistant\n" \
--override "<|SYSTEM_TOKEN|>" "system\n" \
--override "<|START_THINKING|>" "<think>" \
--override "<|END_THINKING|>" "</think>" \
--override "<|START_RESPONSE|>" "<|endoftext|>" \
--override "<|END_RESPONSE|>" "<|endoftext|>" \
--override "<|START_ACTION|>" "<tool_call>" \
--override "<|END_ACTION|>" "</tool_call>" \
--override "<|START_TOOL_RESULT|>" "<tool_response>" \
--override "<|END_TOOL_RESULT|>" "</tool_response>" \
--override "<|BEGINNING_OF_PREFIX_FIM_TOKEN|>" "<|fim_prefix|>" \
--override "<|BEGINNING_OF_MIDDLE_FIM_TOKEN|>" "<|fim_middle|>" \
--override "<|BEGINNING_OF_SUFFIX_FIM_TOKEN|>" "<|fim_suffix|>" \
--override "<|END_OF_MIDDLE_FIM_TOKEN|>" "<|fim_middle|>"
Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
Loading config from 'command-a-03-2025'... Done.
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
Loading tokenizer from 'command-a-03-2025'... Done.
Loading model from 'Qwen2.5-0.5B-Instruct'... Done.
Input model configuration:
- Target vocabulary size : 256000 (used = 255033, unused = 967)
- Donor vocabulary size : 151936
- Donor num layers : 24 (tied embeddings = True)
- Donor hidden size : 896
- Donor attention heads : 14
- Donor intermediate size : 4864 (ratio = 1:5.4)
- Donor total parameters : 494032768 (0.49B)
-- Embedding parameters : 136134656 (0.14B)
-- Non-embedding parameters : 357898112 (0.36B)
Processing 3 automatic token overrides:
✔ 'bos_token_id' : 5 '<BOS_TOKEN>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 255001 '<|END_OF_TURN_TOKEN|>' → [151645] '<|im_end|>'
✔ 'pad_token_id' : 0 '<PAD>' → [151643] '<|endoftext|>'
Processing 29 manual token overrides:
✔ 0 : '<PAD>' → [151643] '<|endoftext|>'
✔ 1 : '<UNK>' → [151643] '<|endoftext|>'
✔ 2 : '<CLS>' → [151643] '<|endoftext|>'
✔ 3 : '<SEP>' → [151643] '<|endoftext|>'
✔ 4 : '<MASK_TOKEN>' → [151643] '<|endoftext|>'
✔ 5 : '<BOS_TOKEN>' → [151643] '<|endoftext|>'
✔ 6 : '<EOS_TOKEN>' → [151643] '<|endoftext|>'
✔ 7 : '<EOP_TOKEN>' → [151643] '<|endoftext|>'
✔ 255000 : '<|START_OF_TURN_TOKEN|>' → [151644] '<|im_start|>'
✔ 255001 : '<|END_OF_TURN_TOKEN|>' → [151645] '<|im_end|>'
✔ 255002 : '<|YES_TOKEN|>' → [151643] '<|endoftext|>'
✔ 255003 : '<|NO_TOKEN|>' → [151643] '<|endoftext|>'
✔ 255004 : '<|GOOD_TOKEN|>' → [151643] '<|endoftext|>'
✔ 255005 : '<|BAD_TOKEN|>' → [151643] '<|endoftext|>'
✔ 255006 : '<|USER_TOKEN|>' → [872, 198] 'user\n'
✔ 255007 : '<|CHATBOT_TOKEN|>' → [77091, 198] 'assistant\n'
✔ 255008 : '<|SYSTEM_TOKEN|>' → [8948, 198] 'system\n'
✔ 255019 : '<|START_THINKING|>' → [13708, 766, 29] '<think>'
✔ 255020 : '<|END_THINKING|>' → [522, 26865, 29] '</think>'
✔ 255021 : '<|START_RESPONSE|>' → [151643] '<|endoftext|>'
✔ 255022 : '<|END_RESPONSE|>' → [151643] '<|endoftext|>'
✔ 255023 : '<|START_ACTION|>' → [151657] '<tool_call>'
✔ 255024 : '<|END_ACTION|>' → [151658] '</tool_call>'
✔ 255025 : '<|START_TOOL_RESULT|>' → [27, 14172, 9655, 29] '<tool_response>'
✔ 255026 : '<|END_TOOL_RESULT|>' → [522, 14172, 9655, 29] '</tool_response>'
✔ 255029 : '<|BEGINNING_OF_PREFIX_FIM_TOKEN|>' → [151659] '<|fim_prefix|>'
✔ 255030 : '<|BEGINNING_OF_MIDDLE_FIM_TOKEN|>' → [151660] '<|fim_middle|>'
✔ 255031 : '<|BEGINNING_OF_SUFFIX_FIM_TOKEN|>' → [151661] '<|fim_suffix|>'
✔ 255032 : '<|END_OF_MIDDLE_FIM_TOKEN|>' → [151660] '<|fim_middle|>'
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...
Transplanting tokens: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 255033/255033 [01:58<00:00, 2145.75token/s]
Transplant mappings:
- 1 to 1 : 87077 (34%)
- 2 to 1 : 117015 (46%)
- 3 to 1 : 33833 (13%)
- 4 to 1 : 10325 (4%)
- 5 to 1 : 3415 (1.3%)
- 6 to 1 : 1486 (0.58%)
- 7 to 1 : 761 (0.3%)
- 8 to 1 : 440 (0.17%)
- 9 to 1 : 302 (0.12%)
- 10 to 1 : 177 (0.069%)
- 11 to 1 : 88 (0.035%)
- 12 to 1 : 47 (0.018%)
- 13 to 1 : 28 (0.011%)
- 14 to 1 : 15 (0.0059%)
- 15 to 1 : 8 (0.0031%)
- 16 to 1 : 6 (0.0024%)
- 17 to 1 : 1 (0.00039%)
- 18 to 1 : 3 (0.0012%)
- 19 to 1 : 1 (0.00039%)
- 21 to 1 : 2 (0.00078%)
- 36 to 1 : 1 (0.00039%)
- 37 to 1 : 1 (0.00039%)
- 39 to 1 : 1 (0.00039%)
Head initialized with:
- Copies : 87077 (34%)
- Means : 167956 (66%)
- Zeros : 967 (0.38%)
Output model configuration:
- Output vocabulary size : 256000
- Output num layers : 24 (tied embeddings = False)
- Output hidden size : 896
- Output attention heads : 14
- Output intermediate size : 4864 (ratio = 1:5.4)
- Output total parameters : 816650112 (0.82B)
-- Embedding parameters : 458752000 (0.46B)
-- Non-embedding parameters : 357898112 (0.36B)
Saving model and tokenizer to 'command-a-03-2025-DRAFT-0.8B-UNTRAINED' folder
Patching 'torch_dtype' in 'command-a-03-2025-DRAFT-0.8B-UNTRAINED/config.json' based on actual saved tensors
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype
Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
output field only)formatted just between <|END_OF_TURN_TOKEN|> tags.
# ==============================
# MODEL AND OUTPUT CONFIGURATION
# ==============================
model_dir = 'models/command-a-03-2025-DRAFT-0.8B-UNTRAINED'
output_dir = 'finetuned'
# ===========================
# TRAINING TYPE CONFIGURATION
# ===========================
full_fine_tune = true
# =======================
# OPTIMIZER CONFIGURATION
# =======================
lr = 5e-5
# ======================
# TRAINING CONFIGURATION
# ======================
sequence_len = 32768
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step
# =====================
# DATASET CONFIGURATION
# =====================
drop_tails = true
[[datasets]]
dataset_path = 'datasets/common-crawl-sample/*.json'
[[datasets]]
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
[[datasets]]
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
I used six RTX A6000 GPUs over three nodes and hence the 60 batch size (6 x 10 gradient accumulation steps = 60).