Vietnamese POS Tagger (TRE-1)
A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1).
Model Description
This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks.
Features
- Architecture: CRF (python-crfsuite)
- Language: Vietnamese
- Tagset: Universal POS tags (UPOS)
- Training Data: undertheseanlp/UDD-v0.1
Feature Templates
The model uses the following feature templates:
- Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks
- Context features: previous and next 1-2 tokens
- Bigram features: adjacent token combinations
- Dictionary features: in-vocabulary checks
Usage
Using the Inference API
import requests
API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1"
headers = {"Authorization": "Bearer YOUR_TOKEN"}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({"inputs": "Tôi yêu Việt Nam"})
print(output)
# [{"token": "Tôi", "tag": "PRON"}, {"token": "yêu", "tag": "VERB"}, ...]
Local Usage
import pycrfsuite
from handler import EndpointHandler
handler = EndpointHandler(path="./")
result = handler({"inputs": "Tôi yêu Việt Nam"})
print(result)
Training
The model was trained using:
- L1 regularization (c1): 1.0
- L2 regularization (c2): 1e-3
- Max iterations: 100
Performance
Evaluated on a held-out test set from UDD-v0.1:
- Accuracy: ~94%
- F1 (macro): ~90%
- F1 (weighted): ~94%
Limitations
- Requires pre-tokenized input (whitespace-separated tokens)
- Performance may vary on out-of-domain text
- Does not handle Vietnamese word segmentation
Citation
If you use this model, please cite:
@misc{tre1-pos-tagger,
author = {undertheseanlp},
title = {Vietnamese POS Tagger TRE-1},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/undertheseanlp/tre-1}
}
License
Apache 2.0