SmolLM3-3B — Spanish Reasoning Instruction Fine-Tune (Nemotron Multilingual Reasoning)

Model Description

This model is a Supervised Fine-Tuned (SFT) version of:

HuggingFaceTB/SmolLM3-3B

Fine-tuned on the Spanish (es) split of:

DGurgurov/Nemotron-Multilingual-Reasoning

The goal of this training run was to improve:

  • Spanish instruction following
  • multi-step reasoning
  • conversational behavior
  • long-context understanding

Training used structured chat conversations and completion-only loss, meaning only the assistant responses were optimized.

Key Characteristics

  • Base model: SmolLM3-3B
  • Language specialization: Spanish
  • Context length during training: 16,384 tokens
  • Chat-format training
  • Packed sequences
  • Long-context reasoning tuning

Intended Uses

Suitable

  • Spanish conversational assistants
  • tutoring or educational assistants
  • reasoning and explanation tasks
  • document question answering
  • research on efficient small LLMs

Not Suitable

  • legal or medical advice
  • autonomous decision making
  • safety-critical systems
  • high-risk financial use

Training Data

Dataset:

DGurgurov/Nemotron-Multilingual-Reasoning

Processing configuration:

  • Language filter: Spanish only
  • Converted to chat messages (prepare_messages=True)
  • Assistant-only optimization (completion_only_loss=True)

User and system messages were masked during training.

Consult the dataset card for data sources and limitations.


Training Procedure

Training was performed using HuggingFace Accelerate with Fully Sharded Data Parallel (FSDP) across 8 processes.

Core Setup

  • Method: Supervised fine-tuning (SFT)
  • Epochs: 3
  • Maximum sequence length: 16,384 tokens
  • Sequence packing: enabled
  • Precision: bfloat16
  • Gradient checkpointing: enabled
  • Liger kernel: enabled
  • Distributed training: FSDP

Optimization

  • Optimizer: adamw_torch_fused
  • Batch size per device: 4
  • Gradient accumulation steps: 4
  • Effective batch size per GPU: 16 sequences per step
  • Weight decay: 0.05

Learning rate schedule:

  • Scheduler: cosine_with_min_lr
  • Warmup ratio: 0.05
  • Minimum LR: 5e-6

Logging & Checkpoints

  • Logging every 5 steps
  • Checkpoint every 450 steps
  • Weights & Biases tracking
  • Token accuracy logged during training

Data Processing

  • Dataset preprocessing workers: 16
  • Chat formatting enabled
  • Dataset preparation enabled
  • Language split: es

Usage

Transformers Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "Eres un asistente útil."},
    {"role": "user", "content": "¿Por qué el cielo es azul?"}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Important:
Use apply_chat_template() when prompting. The model was trained on chat-formatted conversations and performance will degrade without it.


Evaluation

During training, token accuracy was logged as a diagnostic metric.

Token accuracy:

  • monitors training stability
  • is not a benchmark
  • does not measure reasoning ability

For meaningful evaluation, use:

  • instruction-following benchmarks
  • reasoning datasets
  • long-context tasks

Limitations

  • May hallucinate incorrect information
  • Reasoning chains may contain logical errors
  • Performance near 16k tokens depends heavily on prompt structure
  • Smaller model → weaker world knowledge than larger LLMs
  • Not suitable for safety-critical deployment

Bias & Safety

The model inherits biases from:

  • the base model
  • the training dataset

Recommended mitigations:

  • moderation filtering
  • safety-oriented system prompts
  • human review for sensitive applications

License

This is a derivative model of:

HuggingFaceTB/SmolLM3-3B

The original base model license and restrictions apply, along with dataset terms.

Verify compatibility before commercial use.


Reproducibility (Training Arguments)

accelerate launch --use_fsdp --num_processes 8 --config_file sft/my_config.yaml sft/sft_trainer.py

--model_name HuggingFaceTB/SmolLM3-3B
--tokenizer_name HuggingFaceTB/SmolLM3-3B
--dataset_path DGurgurov/Nemotron-Multilingual-Reasoning
--skip_prepare_dataset False
--lang_split es
--prepare_messages True
--completion_only_loss True
--max_length 16384
--dataset_num_proc 16
--packing True
--use_liger_kernel True
--bf16 True
--log_token_accuracy True
--optim adamw_torch_fused
--gradient_checkpointing True
--per_device_train_batch_size 4
--gradient_accumulation_steps 4
--ddp_find_unused_parameters False
--lr_scheduler_type cosine_with_min_lr
--lr_scheduler_kwargs {"min_lr": 5.0e-6}
--warmup_ratio 0.05
--weight_decay 0.05
--report_to wandb
--run_name smol_3b_3epochs_lns_es
--num_train_epochs 3
--save_strategy steps
--logging_steps 5
--save_steps 450

Citation

If you use this model, please cite:

  • HuggingFaceTB/SmolLM3-3B
  • DGurgurov/Nemotron-Multilingual-Reasoning

Acknowledgements

  • HuggingFaceTB — SmolLM3 base model
  • Nemotron Multilingual Reasoning dataset authors
  • HuggingFace Accelerate and Transformers libraries
Downloads last month
30
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for toroe/SmolLM-3B-Science-ES

Finetuned
(101)
this model

Dataset used to train toroe/SmolLM-3B-Science-ES