LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification
LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from PhoBERT-base using a leakage-free and reproducible training recipe.
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.
π Paper
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification
Authors:
Duc Dat PhamΒΉΛΒ³, Trung Quang Nguyenβ΄, Ngoc Tram Huynh ThiΒ²ΛΒ³,
Nguyen Thi Bich NgocΒ²ΛΒ³, and Tan Duy Le*Β²ΛΒ³
ΒΉ University of Science, Ho Chi Minh City, Vietnam
Β² International University, VNU-HCM, Vietnam
Β³ Vietnam National University, Ho Chi Minh City, Vietnam
β΄ Ho Chi Minh City University of Economics and Finance, Vietnam
π§ Model Description
- Backbone: PhoBERT-base (
vinai/phobert-base) - Task: Single-label, multi-class emotion classification
- Language: Vietnamese
- Domain: Social media text
- Number of classes: 7
(Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)
LF-PhoBERT is trained using a unified objective that combines:
- Class-Balanced Focal Loss
- R-Drop consistency regularization
- Supervised Contrastive Learning
- FGM-based adversarial training
All class statistics are computed only on the training split to prevent information leakage.
π Performance (SentiV)
Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.
- Macro-F1: 0.8040 Β± 0.0003
- Accuracy: 0.8144 Β± 0.0004
The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.
π¦ Files in This Repository
model.safetensorsβ fine-tuned model weightsconfig.jsonβ model configurationtokenizer_config.json,vocab.txt,bpe.codesβ tokenizer filesid2label.jsonβ label mappingspecial_tokens_map.json,added_tokens.jsonβ tokenizer metadata
π Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "ducdatit2002/LF-PhoBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "ChiαΊΏn dα»ch nΓ y lΓ m tΓ΄i rαΊ₯t thαΊ₯t vα»ng π‘"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)
π Reproducibility
- Training performed on a single NVIDIA A100 (80GB)
- PyTorch 2.9.1, CUDA 12.8
- Results reported as mean Β± std over 3 random seeds
- Identical preprocessing and optimization settings across runs
This checkpoint is released to support reproducibility and practical deployment.
β οΈ Limitations
- Single-label classification cannot fully capture mixed or ambiguous emotions
- Sarcasm and context-dependent expressions remain challenging
- Performance is evaluated on SentiV; cross-domain generalization is not guaranteed
π Citation
If you use this model, please cite:
Pham, D.D., Nguyen, T.Q., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.
π License
This model is released for research and educational purposes. Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.
- Downloads last month
- -