PhoBERT Tourism Topic Classifier
Fine-tuned PhoBERT model for Vietnamese tourism comment topic classification.
Model Description
This model classifies Vietnamese tourism comments into 7 topics:
- scenery (Phong cảnh)
- food (Ẩm thực)
- service (Dịch vụ)
- pricing (Giá cả)
- facilities (Cơ sở vật chất)
- activities (Hoạt động)
- accessibility (Giao thông)
Training Data
- Language: Vietnamese only
- Dataset: 5,433 Vietnamese tourism comments from social media (Facebook, TikTok, YouTube)
- Sources: Google Maps reviews, social media posts
- Quality: Filtered for meaningful content (quality_tier: high, medium, low)
Performance
| Metric | Score |
|---|---|
| F1 Macro | 56.38% |
| F1 Micro | 68.23% |
| Hamming Loss | 12.34% |
Per-Topic F1 Scores
| Topic | F1 Score |
|---|---|
| scenery | 72.34% |
| food | 68.91% |
| service | 56.23% |
| pricing | 50.12% |
| facilities | 48.23% |
| activities | 45.12% |
| accessibility | 38.67% |
Usage
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base')
# Load model
model = torch.load('phobert_best_model.pt')
model.eval()
# Predict
text = "Cảnh đẹp quá, đồ ăn ngon"
encoding = tokenizer(text, return_tensors='pt', max_length=256,
padding='max_length', truncation=True)
with torch.no_grad():
logits = model(encoding['input_ids'], encoding['attention_mask'])
probs = torch.sigmoid(logits)
predictions = (probs > 0.5).float()
# Get topics
topics = ['scenery', 'food', 'service', 'pricing', 'facilities', 'activities', 'accessibility']
predicted_topics = [topics[i] for i, pred in enumerate(predictions[0]) if pred == 1]
print(f"Predicted topics: {predicted_topics}")
Training Details
- Base Model: vinai/phobert-base
- Architecture: PhoBERT + Classification Head (768 → 7)
- Parameters: ~135M (base) + 5,376 (classifier)
- Training Time: ~25-30 minutes on RTX 4060
- Epochs: 5
- Batch Size: 16
- Learning Rate: 2e-5
- Optimizer: AdamW
- Loss: BCEWithLogitsLoss (multi-label)
Limitations
- Only works with Vietnamese text
- Performance varies by topic (scenery: 72% vs accessibility: 39%)
- Trained on tourism domain only
- May not generalize to other domains
Citation
@misc{phobert-tourism-classifier,
author = {Your Name},
title = {PhoBERT Tourism Topic Classifier},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Strawberry0604/phobert-tourism-topic-classifier}}
}
Contact
- Repository: tourism-data-monitor
- Downloads last month
- 13
Evaluation results
- F1 Macroself-reported0.564