🚨 UNDERFITTING — When your AI pretends to learn 🤖💥

Community Article Published September 10, 2025

Upvote

vloplok

RDTvlokip

📖 Definition

⚡ Advantages / Disadvantages / Limitations
✅ "Advantages" (if we can call them that)

❌ Disadvantages

⚠️ Limitations

🛠️ Practical tutorial: My real case
📊 Setup

📈 Results obtained

🧪 Real-world testing

💡 Concrete examples
Typical underfitting cases

Affected models

📋 Cheat sheet: Diagnosing underfitting
🔍 Warning signals

🛠️ Solutions

⚙️ Recommended config

💻 Code example

📝 Summary

🎯 Conclusion

❓ Q&A

🤓 Did you know?

📖 Definition

Underfitting = your AI model hasn't learned enough to be effective.

Signs:

High train_loss (poor memorization)
High eval_loss (poor generalization)
Trash responses (spam, weird characters)
Disappointing performance on known AND unknown data

⚡ Advantages / Disadvantages / Limitations

✅ "Advantages" (if we can call them that)

No overfitting (at least there's that...)
Fast training (but useless)
Low resource consumption

❌ Disadvantages

Unusable model in production
Waste of time/money on compute
Maximum developer frustration
Catastrophic performance

⚠️ Limitations

Sometimes late detection (after full training)
Confusion with other issues (data quality, bugs)

🛠️ Practical tutorial: My real case

📊 Setup

Model: GPT-2 Small (124M parameters)
Dataset: 80 MB, 125,705 texts, 123,518 Q&A
Config: 1 epoch, LR=5e-5, batch_size=8, max_length=512
Hardware: GTX 1080 Ti, Ryzen 5600G, 48 GB RAM

📈 Results obtained

train_loss: 1.63
eval_loss: 1.42  
perplexity: 4.16

🧪 Real-world testing

Input:  "Hi there!"
Output: "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Input:  "What's DHCP?"  
Output: "DHCP ??????????????????????????????"

Input:  "Natural satellites of Earth?"
Output: "Earth???????????????????????????"

Verdict: 🚨 UNDERFITTING CONFIRMED

💡 Concrete examples

Typical underfitting cases

Insufficient epochs (1 instead of 6-10)
Learning rate too low (1e-6 instead of 5e-5)
Dataset too complex for the model
Inadequate architecture (model too simple)

Affected models

GPT-2 Small trained 1 epoch
BERT Base on massive dataset (1 epoch)
Custom transformers under-dimensioned

📋 Cheat sheet: Diagnosing underfitting

🔍 Warning signals

Train_loss > 2.0 (for NLP)
Eval_loss close to train_loss (no learning)
Perplexity > 10 (confused model)
Repetitive/incoherent outputs

🛠️ Solutions

More epochs (6-10 minimum)
Higher learning rate (5e-5 → 1e-4)
More data (if possible)
More complex architecture

⚙️ Recommended config

epochs: 6-10
learning_rate: 5e-5 to 1e-4  
warmup: 10%
batch_size: 8-16
max_length: 512

💻 Code example

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=6,
    learning_rate=5e-5,
    warmup_ratio=0.1,
    per_device_train_batch_size=8,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    logging_steps=100
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

📝 Summary

Underfitting = under-trained model producing catastrophic results. 1 epoch on 125k examples = recipe for failure. Solution: more epochs, metrics monitoring, patience.

🎯 Conclusion

My underfitted GPT-2 taught me a valuable lesson: data quantity doesn't replace training quality. Next step: 6 epochs, tight monitoring, enriched InfiniGPT dataset.

❓ Q&A

Q: How many epochs minimum to avoid underfitting? A: 6-10 epochs for a 125k examples dataset. You watch the loss curve to adjust.

Q: My model spams characters, is it necessarily underfitting?
A: Yes, very likely. Also check data quality and tokenization.

Q: How to differentiate underfitting from bad data? A: Test on a clean mini-dataset. If it works, it's the data. If it fails, it's underfitting.

🤓 Did you know?

Underfitting was the main problem of early neural networks in the 1960s! Researchers thought their models were "too dumb" to learn, when in reality they simply didn't have enough computing power to do more than 1-2 epochs. It took GPUs and the 2010s to discover that these architectures were viable with sufficient training! 🚀

Théo CHARLET IT Systems & Networks Student - AI/ML Specialization

Creator of AG-BPE (Attention-Guided Tokenization)

🔗 LinkedIn: https://www.linkedin.com/in/théo-charlet

🚀 Seeking internship opportunities

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote