Activity Feed

AI & ML interests

None defined yet.

Recent Activity

hadung1802  updated a Space 3 days ago
visolex/README
AnnyNguyen  updated a Space 3 months ago
visolex/README
AnnyNguyen  updated a model 3 months ago
visolex/emotion-sphobert
View all activity

Organization Card

📦 ViSoNorm Toolkit — Vietnamese Text Normalization & Processing

ViSoNorm is a specialized toolkit for Vietnamese text normalization and processing, optimized for NLP environments and easily installable via PyPI. Resources (datasets, models) are stored and managed directly on Hugging Face Hub and GitHub Releases.


🚀 Key Features

1. 🔧 BasicNormalizer — Basic Text Normalization

  • Case folding: convert entire text to lowercase/uppercase/capitalize.
  • Tone normalization: normalize Vietnamese tone marks.
  • Basic preprocessing: remove extra whitespace, special characters, sentence formatting.

2. 😀 EmojiHandler — Emoji Processing

  • Detect emojis: detect emojis in text.
  • Split emoji text: separate emojis from sentences.
  • Remove emojis: remove all emojis.

3. ✏️ Lexical Normalization — Social Media Text Normalization

  • ViSoLexNormalizer: Normalize text using deep learning models from HuggingFace.
  • NswDetector: Detect non-standard words (NSW).
  • detect_nsw(): Utility function to detect NSW.
  • normalize_sentence(): Utility function to normalize sentences.

4. 📊 Resource Management — Dataset Management

  • list_datasets() — List available datasets.
  • load_dataset() — Load dataset from GitHub Releases.
  • get_dataset_info() — View detailed dataset information.

5. 🧠 Task Models — Task Processing Models

  • SpamReviewDetection — Spam detection.
  • HateSpeechDetection — Hate speech detection.
  • HateSpeechSpanDetection — Hate speech span detection.
  • EmotionRecognition — Emotion recognition.
  • AspectSentimentAnalysis — Aspect-based sentiment analysis.

📥 Installation

Install from PyPI (Recommended)

pip install visonorm

📝 Citation

ViSoLex is developed at the University of Information Technology, Vietnam National University Ho Chi Minh City (UIT, VNU-HCM). If you use ViSoLex in your research, please CITE:

@article{nguyen_weakly_2025,
    title = {A {Weakly} {Supervised} {Data} {Labeling} {Framework} for {Machine} {Lexical} {Normalization} in {Vietnamese} {Social} {Media}},
    volume = {17},
    issn = {1866-9964},
    url = {https://doi.org/10.1007/s12559-024-10356-3},
    doi = {10.1007/s12559-024-10356-3},
    number = {1},
    journal = {Cognitive Computation},
    author = {Nguyen, Dung Ha and Nguyen, Anh Thi Hoang and Van Nguyen, Kiet},
    month = jan,
    year = {2025},
    pages = {57},
}
@inproceedings{nguyen-etal-2025-visolex,
    title = "{V}i{S}o{L}ex: An Open-Source Repository for {V}ietnamese Social Media Lexical Normalization",
    author = "Nguyen, Anh Thi-Hoang  and
      Nguyen, Dung Ha  and
      Nguyen, Kiet Van",
    editor = "Rambow, Owen  and
      Wanner, Leo  and
      Apidianaki, Marianna  and
      Al-Khalifa, Hend  and
      Eugenio, Barbara Di  and
      Schockaert, Steven  and
      Mather, Brodie  and
      Dras, Mark",
    booktitle = "Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations",
    month = jan,
    year = "2025",
    address = "Abu Dhabi, UAE",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.coling-demos.18/",
    pages = "183--188",
}