π¨ UNDERFITTING β When your AI pretends to learn π€π₯
π Definition
Underfitting = your AI model hasn't learned enough to be effective.
Signs:
- High train_loss (poor memorization)
- High eval_loss (poor generalization)
- Trash responses (spam, weird characters)
- Disappointing performance on known AND unknown data
β‘ Advantages / Disadvantages / Limitations
β "Advantages" (if we can call them that)
- No overfitting (at least there's that...)
- Fast training (but useless)
- Low resource consumption
β Disadvantages
- Unusable model in production
- Waste of time/money on compute
- Maximum developer frustration
- Catastrophic performance
β οΈ Limitations
- Sometimes late detection (after full training)
- Confusion with other issues (data quality, bugs)
π οΈ Practical tutorial: My real case
π Setup
- Model: GPT-2 Small (124M parameters)
- Dataset: 80 MB, 125,705 texts, 123,518 Q&A
- Config: 1 epoch, LR=5e-5, batch_size=8, max_length=512
- Hardware: GTX 1080 Ti, Ryzen 5600G, 48 GB RAM
π Results obtained
train_loss: 1.63
eval_loss: 1.42
perplexity: 4.16
π§ͺ Real-world testing
Input: "Hi there!"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
Input: "What's DHCP?"
Output: "DHCP ??????????????????????????????"
Input: "Natural satellites of Earth?"
Output: "Earth???????????????????????????"
Verdict: π¨ UNDERFITTING CONFIRMED
π‘ Concrete examples
Typical underfitting cases
- Insufficient epochs (1 instead of 6-10)
- Learning rate too low (1e-6 instead of 5e-5)
- Dataset too complex for the model
- Inadequate architecture (model too simple)
Affected models
- GPT-2 Small trained 1 epoch
- BERT Base on massive dataset (1 epoch)
- Custom transformers under-dimensioned
π Cheat sheet: Diagnosing underfitting
π Warning signals
- Train_loss > 2.0 (for NLP)
- Eval_loss close to train_loss (no learning)
- Perplexity > 10 (confused model)
- Repetitive/incoherent outputs
π οΈ Solutions
- More epochs (6-10 minimum)
- Higher learning rate (5e-5 β 1e-4)
- More data (if possible)
- More complex architecture
βοΈ Recommended config
epochs: 6-10
learning_rate: 5e-5 to 1e-4
warmup: 10%
batch_size: 8-16
max_length: 512
π» Code example
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=6,
learning_rate=5e-5,
warmup_ratio=0.1,
per_device_train_batch_size=8,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
logging_steps=100
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
π Summary
Underfitting = under-trained model producing catastrophic results. 1 epoch on 125k examples = recipe for failure. Solution: more epochs, metrics monitoring, patience.
π― Conclusion
My underfitted GPT-2 taught me a valuable lesson: data quantity doesn't replace training quality. Next step: 6 epochs, tight monitoring, enriched InfiniGPT dataset.
β Q&A
Q: How many epochs minimum to avoid underfitting? A: 6-10 epochs for a 125k examples dataset. You watch the loss curve to adjust.
Q: My model spams characters, is it necessarily underfitting?
A: Yes, very likely. Also check data quality and tokenization.
Q: How to differentiate underfitting from bad data? A: Test on a clean mini-dataset. If it works, it's the data. If it fails, it's underfitting.
π€ Did you know?
Underfitting was the main problem of early neural networks in the 1960s! Researchers thought their models were "too dumb" to learn, when in reality they simply didn't have enough computing power to do more than 1-2 epochs. It took GPUs and the 2010s to discover that these architectures were viable with sufficient training! π
ThΓ©o CHARLET IT Systems & Networks Student - AI/ML Specialization
Creator of AG-BPE (Attention-Guided Tokenization)
π LinkedIn: https://www.linkedin.com/in/thΓ©o-charlet
π Seeking internship opportunities