Vietnamese POS Tagger (TRE-1)

A Conditional Random Field (CRF) based Part-of-Speech tagger for Vietnamese, trained on the Universal Dependencies Dataset (UDD-v0.1).

Model Description

This model uses CRF (Conditional Random Fields) with handcrafted features inspired by the underthesea NLP library. It achieves high accuracy on Vietnamese POS tagging tasks.

Features

  • Architecture: CRF (python-crfsuite)
  • Language: Vietnamese
  • Tagset: Universal POS tags (UPOS)
  • Training Data: undertheseanlp/UDD-v0.1

Feature Templates

The model uses the following feature templates:

  • Current token features: word form, lowercase, prefix/suffix (2-3 chars), character type checks
  • Context features: previous and next 1-2 tokens
  • Bigram features: adjacent token combinations
  • Dictionary features: in-vocabulary checks

Usage

Using the Inference API

import requests

API_URL = "https://api-inference.huggingface.co/models/undertheseanlp/tre-1"
headers = {"Authorization": "Bearer YOUR_TOKEN"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({"inputs": "Tôi yêu Việt Nam"})
print(output)
# [{"token": "Tôi", "tag": "PRON"}, {"token": "yêu", "tag": "VERB"}, ...]

Local Usage

import pycrfsuite
from handler import EndpointHandler

handler = EndpointHandler(path="./")
result = handler({"inputs": "Tôi yêu Việt Nam"})
print(result)

Training

The model was trained using:

  • L1 regularization (c1): 1.0
  • L2 regularization (c2): 1e-3
  • Max iterations: 100

Performance

Evaluated on a held-out test set from UDD-v0.1:

  • Accuracy: ~94%
  • F1 (macro): ~90%
  • F1 (weighted): ~94%

Limitations

  • Requires pre-tokenized input (whitespace-separated tokens)
  • Performance may vary on out-of-domain text
  • Does not handle Vietnamese word segmentation

Citation

If you use this model, please cite:

@misc{tre1-pos-tagger,
  author = {undertheseanlp},
  title = {Vietnamese POS Tagger TRE-1},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/undertheseanlp/tre-1}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train undertheseanlp/tre-1