You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

LF-PhoBERT: Leakage-Free Robust Vietnamese Emotion Classification

LF-PhoBERT is a robust Vietnamese emotion classification model fine-tuned from PhoBERT-base using a leakage-free and reproducible training recipe.
The model is designed for noisy social media text and imbalanced emotion distributions, with a focus on stability and deployment-oriented evaluation.

πŸ“„ Paper

A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification

Authors:
Duc Dat PhamΒΉΛ’Β³, Trung Quang Nguyen⁴, Ngoc Tram Huynh ThiΒ²Λ’Β³,
Nguyen Thi Bich NgocΒ²Λ’Β³, and Tan Duy Le*Β²Λ’Β³

ΒΉ University of Science, Ho Chi Minh City, Vietnam
Β² International University, VNU-HCM, Vietnam
Β³ Vietnam National University, Ho Chi Minh City, Vietnam
⁴ Ho Chi Minh City University of Economics and Finance, Vietnam

🧠 Model Description

  • Backbone: PhoBERT-base (vinai/phobert-base)
  • Task: Single-label, multi-class emotion classification
  • Language: Vietnamese
  • Domain: Social media text
  • Number of classes: 7
    (Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise)

LF-PhoBERT is trained using a unified objective that combines:

  • Class-Balanced Focal Loss
  • R-Drop consistency regularization
  • Supervised Contrastive Learning
  • FGM-based adversarial training

All class statistics are computed only on the training split to prevent information leakage.

πŸ“Š Performance (SentiV)

Evaluated on the SentiV dataset using a stratified 80/20 split, averaged over 3 random seeds.

  • Macro-F1: 0.8040 Β± 0.0003
  • Accuracy: 0.8144 Β± 0.0004

The model outperforms standard PhoBERT fine-tuning with cross-entropy loss and demonstrates stable behavior across random seeds.

πŸ“¦ Files in This Repository

  • model.safetensors – fine-tuned model weights
  • config.json – model configuration
  • tokenizer_config.json, vocab.txt, bpe.codes – tokenizer files
  • id2label.json – label mapping
  • special_tokens_map.json, added_tokens.json – tokenizer metadata

πŸš€ Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "ducdatit2002/LF-PhoBERT"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "ChiαΊΏn dα»‹ch nΓ y lΓ m tΓ΄i rαΊ₯t thαΊ₯t vọng 😑"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)

predicted_label_id = outputs.logits.argmax(dim=-1).item()
label = model.config.id2label[str(predicted_label_id)]
print(label)

πŸ” Reproducibility

  • Training performed on a single NVIDIA A100 (80GB)
  • PyTorch 2.9.1, CUDA 12.8
  • Results reported as mean Β± std over 3 random seeds
  • Identical preprocessing and optimization settings across runs

This checkpoint is released to support reproducibility and practical deployment.

⚠️ Limitations

  • Single-label classification cannot fully capture mixed or ambiguous emotions
  • Sarcasm and context-dependent expressions remain challenging
  • Performance is evaluated on SentiV; cross-domain generalization is not guaranteed

πŸ“š Citation

If you use this model, please cite:

Pham, D.D., Nguyen, T.Q., Huynh Thi, N.T., Nguyen, N.T.B., & Le, T.D. (2026).
A Leakage-Free Robust Fine-Tuning Recipe for Vietnamese Emotion Classification.

πŸ“œ License

This model is released for research and educational purposes. Please refer to the PhoBERT license and the SentiV dataset terms for downstream usage.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support