aarya-vsdk's picture
Initial Commit for Namo Turn Detector v1
ed49400 verified
|
raw
history blame
7.14 kB
metadata
language: bn
license: apache-2.0
library_name: onnxruntime
pipeline_tag: text-classification
tags:
  - turn-detection
  - end-of-utterance
  - distilbert
  - onnx
  - quantized
  - conversational-ai
  - voice-assistant
  - real-time
base_model: distilbert-base-multilingual-cased
datasets:
  - videosdk-live/Namo-Turn-Detector-v1-Train
model-index:
  - name: Namo Turn Detector v1 - Bengali
    results:
      - task:
          type: text-classification
          name: Turn Detection
        dataset:
          name: Namo Turn Detector v1 Test - Bengali
          type: videosdk-live/Namo-Turn-Detector-v1-Test
          split: train
        metrics:
          - type: accuracy
            value: 0.792
            name: Accuracy
          - type: f1
            value: 0.789474
            name: F1 Score
          - type: precision
            value: 0.783133
            name: Precision
          - type: recall
            value: 0.795918
            name: Recall

🎯 Namo Turn Detector v1 - Bengali

License ONNX Model Size Inference Speed

🚀 Namo Turn Detection Model for Bengali


📋 Overview

The Namo Turn Detector is a specialized AI model designed to solve one of the most challenging problems in conversational AI: knowing when a user has finished speaking.

This Bengali-specialist model uses advanced natural language understanding to distinguish between:

  • Complete utterances (user is done speaking)
  • 🔄 Incomplete utterances (user will continue speaking)

Built on DistilBERT architecture and optimized with quantized ONNX format, it delivers enterprise-grade performance with minimal latency.

🔑 Key Features

  • Turn Detection Specialist: Detects end-of-turn vs. continuation in Bengali speech transcripts.
  • Low Latency: Optimized with quantized ONNX for <14ms inference.
  • Robust Performance: 79.2% accuracy on diverse Bengali utterances.
  • Easy Integration: Compatible with Python, ONNX Runtime, and VideoSDK Agents SDK.
  • Enterprise Ready: Supports real-time conversational AI and voice assistants.

📊 Performance Metrics

Metric Score
🎯 Accuracy 79.20%
📈 F1-Score 78.94%
🎪 Precision 78.31%
🎭 Recall 79.59%
⚡ Latency <14ms
💾 Model Size ~135MB
Alt text

📊 Evaluated on 900+ Bengali utterances from diverse conversational contexts

⚡️ Speed Analysis

Alt text

🔧 Train & Test Scripts

Train Script Test Script

🛠️ Installation

To use this model, you will need to install the following libraries.

pip install onnxruntime transformers huggingface_hub

🚀 Quick Start

You can run inference directly from Hugging Face repository.

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

class TurnDetector:
    def __init__(self, repo_id="videosdk-live/Namo-Turn-Detector-v1-Bengali"):
        """
        Initializes the detector by downloading the model and tokenizer
        from the Hugging Face Hub.
        """
        print(f"Loading model from repo: {repo_id}")
        
        # Download the model and tokenizer from the Hub
        # Authentication is handled automatically if you are logged in
        model_path = hf_hub_download(repo_id=repo_id, filename="model_quant.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(repo_id)
        
        # Set up the ONNX Runtime inference session
        self.session = ort.InferenceSession(model_path)
        self.max_length = 512
        print("✅ Model and tokenizer loaded successfully.")

    def predict(self, text: str) -> str:
        """
        Predicts if a given text utterance is the end of a turn.
        Returns "End of Turn" or "Not End of Turn".
        """
        # Tokenize the input text
        inputs = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            return_tensors="np"
        )
        
        # Prepare the feed dictionary for the ONNX model
        feed_dict = {
            "input_ids": inputs["input_ids"],
            "attention_mask": inputs["attention_mask"]
        }
        
        # Run inference
        outputs = self.session.run(None, feed_dict)
        logits = outputs
        
        # Get the predicted class (0 or 1)
        prediction_index = np.argmax(logits, axis=1)
        
        return "End of Turn" if prediction_index == 1 else "Not End of Turn"

# --- Example Usage ---
if __name__ == "__main__":
    detector = TurnDetector()
    
    sentences = [
        "চিংগ্রি উত্পাদন হোয়েছে, আছা, 6,965 মেট্রিক টন।",      # Expected: End of Turn
        "এর আযোতন মানে প্রায় শাই ত্রিশ হাজার ছোয়ে শো।", # Expected: Not End of Turn
    ]
    
    for sentence in sentences:
        result = detector.predict(sentence)
        print(f"'{sentence}' -> {result}")

🤖 VideoSDK Agents Integration

Integrate this turn detector directly with VideoSDK Agents for production-ready conversational AI applications.

from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

#download model
pre_download_namo_turn_v1_model(language="bn")

# Initialize Bengali turn detector for VideoSDK Agents
turn_detector = NamoTurnDetectorV1(language="bn")

📚 Complete Integration Guide - Learn how to use NamoTurnDetectorV1 with VideoSDK Agents

📖 Citation

@model{namo_turn_detector_bn_2025,
  title={Namo Turn Detector v1: Bengali},
  author={VideoSDK Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Bengali},
  note={ONNX-optimized DistilBERT for turn detection in Bengali}
}

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Made with ❤️ by the VideoSDK Team

VideoSDK