SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)
Model Description
This model is a Supervised Fine-Tuned (SFT) version of:
HuggingFaceTB/SmolLM3-3B
Fine-tuned on the Spanish (es) split of:
DGurgurov/Nemotron-Multilingual-Reasoning
The goal of this training run was to improve:
- Spanish instruction following
- multi-step reasoning
- conversational behavior
- long-context understanding
Training used structured chat conversations and completion-only loss, meaning only the assistant responses were optimized.
Key Characteristics
- Base model: SmolLM3-3B
- Language specialization: Spanish
- Context length during training: 16,384 tokens
- Chat-format training
- Packed sequences
- Long-context reasoning tuning
Intended Uses
Suitable
- Spanish conversational assistants
- tutoring or educational assistants
- reasoning and explanation tasks
- document question answering
- research on efficient small LLMs
Not Suitable
- legal or medical advice
- autonomous decision making
- safety-critical systems
- high-risk financial use
Training Data
Dataset:
DGurgurov/Nemotron-Multilingual-Reasoning
Processing configuration:
- Language filter: Spanish only
- Converted to chat messages (
prepare_messages=True) - Assistant-only optimization (
completion_only_loss=True)
User and system messages were masked during training.
Consult the dataset card for data sources and limitations.
Training Procedure
Training was performed using HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP) across 8 processes.
Core Setup
- Method: Supervised fine-tuning (SFT)
- Epochs: 3
- Maximum sequence length: 16,384 tokens
- Sequence packing: enabled
- Precision: bfloat16
- Gradient checkpointing: enabled
- Liger kernel: enabled
- Distributed training: FSDP
Optimization
- Optimizer:
adamw_torch_fused - Batch size per device: 4
- Gradient accumulation steps: 4
- Effective batch size per GPU: 16 sequences per step
- Weight decay: 0.05
Learning rate schedule:
- Scheduler:
cosine_with_min_lr - Warmup ratio: 0.05
- Minimum LR: 5e-6
Logging & Checkpoints
- Logging every 5 steps
- Checkpoint every 450 steps
- Weights & Biases tracking
- Token accuracy logged during training
Data Processing
- Dataset preprocessing workers: 16
- Chat formatting enabled
- Dataset preparation enabled
- Language split:
es
Usage
Transformers Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "system", "content": "Eres un asistente útil."},
{"role": "user", "content": "¿Por qué el cielo es azul?"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Important:
Use apply_chat_template() when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.
Evaluation
During training, token accuracy was logged as a diagnostic metric.
Token accuracy:
- monitors training stability
- is not a benchmark
- does not measure reasoning ability
For meaningful evaluation, use:
- instruction-following benchmarks
- reasoning datasets
- long-context tasks
Limitations
- May hallucinate incorrect information
- Reasoning chains may contain logical errors
- Performance near 16k tokens depends heavily on prompt structure
- Smaller model → weaker world knowledge than larger LLMs
- Not suitable for safety-critical deployment
Bias & Safety
The model inherits biases from:
- the base model
- the training dataset
Recommended mitigations:
- moderation filtering
- safety-oriented system prompts
- human review for sensitive applications
License
This is a derivative model of:
HuggingFaceTB/SmolLM3-3B
The original base model license and restrictions apply, along with dataset terms.
Verify compatibility before commercial use.
Reproducibility (Training Arguments)
accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py
--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split es
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name smol_3b_3epochs_lns_es
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 450
Citation
If you use this model, please cite:
HuggingFaceTB/SmolLM3-3BDGurgurov/Nemotron-Multilingual-Reasoning
Acknowledgements
- HuggingFaceTB — SmolLM3 base model
- Nemotron Multilingual Reasoning dataset authors
- HuggingFace Accelerate and Transformers libraries
- Downloads last month
- 30
Model tree for toroe/SmolLM-3B-Science-ES
Base model
HuggingFaceTB/SmolLM3-3B-Base