ViHSD-UIT-ViSoBERT

📝 Giới thiệu

ViHSD-UIT-ViSoBERT là mô hình học sâu được fine-tune từ mô hình tiền huấn luyện ViSoBERT, chuyên biệt cho tác vụ phân loại ngôn từ thù ghét tiếng Việt (Vietnamese Hate Speech Detection). Mô hình được huấn luyện trên tập dữ liệu ViHSD.

🧠 Kiến trúc

Mô hình nền: uitnlp/visobert
Loại mô hình: Text Classification
Số nhãn: 3
- 0: clean (Bình luận không chứa ngôn từ thù ghét hoặc xúc phạm)
- 1: offensive (Bình luận có tính xúc phạm hoặc gây khó chịu, nhưng không mang tính thù ghét rõ ràng)
- 2: hate (Bình luận chứa ngôn từ thù ghét, thường nhắm vào cá nhân hoặc nhóm cụ thể)
Dữ liệu huấn luyện: sonlam1102/vihsd

📊 Đánh giá

Metric	Value
Accuracy	88.32%
F1-macro	0.690
F1-weighted	0.881
Loss	0.358
Epochs	10
Eval time	32.99s

🚀 Cách sử dụng

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("nd-khoa/vihsd-uit-visobert")
tokenizer = AutoTokenizer.from_pretrained("nd-khoa/vihsd-uit-visobert")

text = input("Nhập 1 câu bình luận bất kỳ:\n")
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
logits = outputs.logits
prediction = logits.argmax(dim=1).item()
print(f"Label: {prediction}")

📚 Trích dẫn

🔹 ViSoBERT

@inproceedings{nguyen-etal-2023-visobert,
  title = "ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing",
  author = "Nguyen, Nam  and Phan, Thang  and Nguyen, Duc-Vu  and Nguyen, Kiet",
  booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
  year = "2023",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2023.emnlp-main.315",
  pages = "5191--5207"
}

🔹 ViHSD Dataset

@inproceedings{luu2021large,
  title={A large-scale dataset for hate speech detection on Vietnamese social media texts},
  author={Luu, Son T and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy},
  booktitle={Advances and Trends in Artificial Intelligence. Artificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021},
  pages={415--426},
  year={2021},
  publisher={Springer}
}

👤 Tác giả

Tên: Nguyễn Đăng Khoa
Email: 23520746@gm.uit.edu.vn

Downloads last month: 56

Safetensors

Model size

97.6M params

Tensor type

F32