|
|
--- |
|
|
language: ar |
|
|
license: apache-2.0 |
|
|
library_name: onnxruntime |
|
|
pipeline_tag: voice-activity-detection |
|
|
tags: |
|
|
- turn-detection |
|
|
- end-of-utterance |
|
|
- distilbert |
|
|
- onnx |
|
|
- quantized |
|
|
- conversational-ai |
|
|
- voice-assistant |
|
|
- real-time |
|
|
base_model: distilbert-base-multilingual-cased |
|
|
datasets: |
|
|
- videosdk-live/Namo-Turn-Detector-v1-Train |
|
|
model-index: |
|
|
- name: Namo Turn Detector v1 - Arabic |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Turn Detection |
|
|
dataset: |
|
|
name: Namo Turn Detector v1 Test - Arabic |
|
|
type: videosdk-live/Namo-Turn-Detector-v1-Test |
|
|
split: train |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.797254 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 0.827027 |
|
|
name: F1 Score |
|
|
- type: precision |
|
|
value: 0.72973 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 0.954262 |
|
|
name: Recall |
|
|
--- |
|
|
|
|
|
# ๐ฏ Namo Turn Detector v1 - Arabic |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://onnx.ai/) |
|
|
[](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Arabic) |
|
|
[]() |
|
|
|
|
|
**๐ Namo Turn Detection Model for Arabic** |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Overview |
|
|
|
|
|
The **Namo Turn Detector** is a specialized AI model designed to solve one of the most challenging problems in conversational AI: **knowing when a user has finished speaking**. |
|
|
|
|
|
This Arabic-specialist model uses advanced natural language understanding to distinguish between: |
|
|
- โ
**Complete utterances** (user is done speaking) |
|
|
- ๐ **Incomplete utterances** (user will continue speaking) |
|
|
|
|
|
Built on DistilBERT architecture and optimized with quantized ONNX format, it delivers enterprise-grade performance with minimal latency. |
|
|
|
|
|
## ๐ Key Features |
|
|
|
|
|
- **Turn Detection Specialist**: Detects end-of-turn vs. continuation in Arabic speech transcripts. |
|
|
- **Low Latency**: Optimized with **quantized ONNX** for <15ms inference. |
|
|
- **Robust Performance**: 79.7% accuracy on diverse Arabic utterances. |
|
|
- **Easy Integration**: Compatible with Python, ONNX Runtime, and VideoSDK Agents SDK. |
|
|
- **Enterprise Ready**: Supports real-time conversational AI and voice assistants. |
|
|
|
|
|
## ๐ Performance Metrics |
|
|
<div> |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **๐ฏ Accuracy** | **79.72%** | |
|
|
| **๐ F1-Score** | **82.70%** | |
|
|
| **๐ช Precision** | **72.97%** | |
|
|
| **๐ญ Recall** | **95.42%** | |
|
|
| **โก Latency** | **<15ms** | |
|
|
| **๐พ Model Size** | **~135MB** | |
|
|
|
|
|
</div> |
|
|
<img src="./confusion_matrices.png" alt="Alt text" width="600" height="400"/> |
|
|
|
|
|
> ๐ *Evaluated on 800+ Arabic utterances from diverse conversational contexts* |
|
|
|
|
|
## โก๏ธ Speed Analysis |
|
|
|
|
|
<img src="./performance_analysis.png" alt="Alt text" width="600" height="400"/> |
|
|
|
|
|
## ๐ง Train & Test Scripts |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://colab.research.google.com/drive/1DqSUYfcya0r2iAEZB9fS4mfrennubduV) [](https://colab.research.google.com/drive/19ZOlNoHS2WLX2V4r5r492tsCUnYLXnQR) |
|
|
|
|
|
</div> |
|
|
|
|
|
## ๐ ๏ธ Installation |
|
|
|
|
|
To use this model, you will need to install the following libraries. |
|
|
|
|
|
```bash |
|
|
pip install onnxruntime transformers huggingface_hub |
|
|
``` |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
You can run inference directly from Hugging Face repository. |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
import onnxruntime as ort |
|
|
from transformers import AutoTokenizer |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
class TurnDetector: |
|
|
def __init__(self, repo_id="videosdk-live/Namo-Turn-Detector-v1-Arabic"): |
|
|
""" |
|
|
Initializes the detector by downloading the model and tokenizer |
|
|
from the Hugging Face Hub. |
|
|
""" |
|
|
print(f"Loading model from repo: {repo_id}") |
|
|
|
|
|
# Download the model and tokenizer from the Hub |
|
|
# Authentication is handled automatically if you are logged in |
|
|
model_path = hf_hub_download(repo_id=repo_id, filename="model_quant.onnx") |
|
|
self.tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
|
|
|
# Set up the ONNX Runtime inference session |
|
|
self.session = ort.InferenceSession(model_path) |
|
|
self.max_length = 512 |
|
|
print("โ
Model and tokenizer loaded successfully.") |
|
|
|
|
|
def predict(self, text: str) -> tuple: |
|
|
""" |
|
|
Predicts if a given text utterance is the end of a turn. |
|
|
Returns (predicted_label, confidence) where: |
|
|
- predicted_label: 0 for "Not End of Turn", 1 for "End of Turn" |
|
|
- confidence: confidence score between 0 and 1 |
|
|
""" |
|
|
# Tokenize the input text |
|
|
inputs = self.tokenizer( |
|
|
text, |
|
|
truncation=True, |
|
|
max_length=self.max_length, |
|
|
return_tensors="np" |
|
|
) |
|
|
|
|
|
# Prepare the feed dictionary for the ONNX model |
|
|
feed_dict = { |
|
|
"input_ids": inputs["input_ids"], |
|
|
"attention_mask": inputs["attention_mask"] |
|
|
} |
|
|
|
|
|
# Run inference |
|
|
outputs = self.session.run(None, feed_dict) |
|
|
logits = outputs[0] |
|
|
|
|
|
probabilities = self._softmax(logits[0]) |
|
|
predicted_label = np.argmax(probabilities) |
|
|
confidence = float(np.max(probabilities)) |
|
|
|
|
|
return predicted_label, confidence |
|
|
|
|
|
def _softmax(self, x, axis=None): |
|
|
if axis is None: |
|
|
axis = -1 |
|
|
exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True)) |
|
|
return exp_x / np.sum(exp_x, axis=axis, keepdims=True) |
|
|
|
|
|
# --- Example Usage --- |
|
|
if __name__ == "__main__": |
|
|
detector = TurnDetector() |
|
|
|
|
|
sentences = [ |
|
|
"ูุถุฎู
ุนูู ุงููุจุฏ ู" # Expected: Not End of Turn |
|
|
] |
|
|
|
|
|
for sentence in sentences: |
|
|
predicted_label, confidence = detector.predict(sentence) |
|
|
result = "End of Turn" if predicted_label == 1 else "Not End of Turn" |
|
|
print(f"'{sentence}' -> {result} (confidence: {confidence:.3f})") |
|
|
print("-" * 50) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## ๐ค VideoSDK Agents Integration |
|
|
|
|
|
Integrate this turn detector directly with VideoSDK Agents for production-ready conversational AI applications. |
|
|
|
|
|
```python |
|
|
from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model |
|
|
|
|
|
#download model |
|
|
pre_download_namo_turn_v1_model(language="ar") |
|
|
|
|
|
# Initialize Arabic turn detector for VideoSDK Agents |
|
|
turn_detector = NamoTurnDetectorV1(language="ar") |
|
|
``` |
|
|
|
|
|
> ๐ [**Complete Integration Guide**](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) - Learn how to use `NamoTurnDetectorV1` with VideoSDK Agents |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
```bibtex |
|
|
@model{namo_turn_detector_ar_2025, |
|
|
title={Namo Turn Detector v1: Arabic}, |
|
|
author={VideoSDK Team}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Arabic}, |
|
|
note={ONNX-optimized DistilBERT for turn detection in Arabic} |
|
|
} |
|
|
``` |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Made with โค๏ธ by the VideoSDK Team** |
|
|
|
|
|
[](https://videosdk.live) |
|
|
|
|
|
</div> |