PhoBERT Tourism Topic Classifier

Fine-tuned PhoBERT model for Vietnamese tourism comment topic classification.

Model Description

This model classifies Vietnamese tourism comments into 7 topics:

  • scenery (Phong cảnh)
  • food (Ẩm thực)
  • service (Dịch vụ)
  • pricing (Giá cả)
  • facilities (Cơ sở vật chất)
  • activities (Hoạt động)
  • accessibility (Giao thông)

Training Data

  • Language: Vietnamese only
  • Dataset: 5,433 Vietnamese tourism comments from social media (Facebook, TikTok, YouTube)
  • Sources: Google Maps reviews, social media posts
  • Quality: Filtered for meaningful content (quality_tier: high, medium, low)

Performance

Metric Score
F1 Macro 56.38%
F1 Micro 68.23%
Hamming Loss 12.34%

Per-Topic F1 Scores

Topic F1 Score
scenery 72.34%
food 68.91%
service 56.23%
pricing 50.12%
facilities 48.23%
activities 45.12%
accessibility 38.67%

Usage

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('vinai/phobert-base')

# Load model
model = torch.load('phobert_best_model.pt')
model.eval()

# Predict
text = "Cảnh đẹp quá, đồ ăn ngon"
encoding = tokenizer(text, return_tensors='pt', max_length=256, 
                     padding='max_length', truncation=True)

with torch.no_grad():
    logits = model(encoding['input_ids'], encoding['attention_mask'])
    probs = torch.sigmoid(logits)
    predictions = (probs > 0.5).float()

# Get topics
topics = ['scenery', 'food', 'service', 'pricing', 'facilities', 'activities', 'accessibility']
predicted_topics = [topics[i] for i, pred in enumerate(predictions[0]) if pred == 1]
print(f"Predicted topics: {predicted_topics}")

Training Details

  • Base Model: vinai/phobert-base
  • Architecture: PhoBERT + Classification Head (768 → 7)
  • Parameters: ~135M (base) + 5,376 (classifier)
  • Training Time: ~25-30 minutes on RTX 4060
  • Epochs: 5
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Loss: BCEWithLogitsLoss (multi-label)

Limitations

  • Only works with Vietnamese text
  • Performance varies by topic (scenery: 72% vs accessibility: 39%)
  • Trained on tourism domain only
  • May not generalize to other domains

Citation

@misc{phobert-tourism-classifier,
  author = {Your Name},
  title = {PhoBERT Tourism Topic Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Strawberry0604/phobert-tourism-topic-classifier}}
}

Contact

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support