You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

viEar-v1.0: State-of-the-Art Multilingual Speaker Verification

🎯 Production-Ready | 🌍 Multilingual | ⚡ Fast Inference| 🛡️ Commercial version

📋 Overview

viEar-V1.0 is a production-grade multilingual speaker verification model developed by BRIGHTO. Trained on a diverse corpus of 8,686 speakers across 11+ languages with approximately 3 million utterances, this model achieves state-of-the-art performance on multilingual speaker verification tasks.

Unlike traditional models trained with Angular Margin losses (ArcFace, AAM-Softmax), vviEar-V1.0 employs Pure Supervised Contrastive Learning (SupCon) with a novel Differentiable Global Batching strategy, enabling robust cross-lingual speaker representations.

✨ Key Features

🌍 Multilingual: Trained on 11+ languages including English, Vietnamese, and other Asian/European languages
🎯 High Accuracy: 1.67% EER with Adaptive Score Normalization
⚡ Lightweight: 192-dimensional embeddings, ~20.8M parameters
🔧 Production-Ready: Extensively tested for deployment scenarios
🛡️ Robust: Resistant to domain mismatch and acoustic variations

🚀 Performance

Evaluated on a challenging held-out test set of 1,250 speakers (Closed-Set, Multilingual):

Metric	Raw Cosine	With AS-Norm	Improvement
Equal Error Rate (EER)	1.90%	1.67%	-12.1%
minDCF (P=0.01)	0.0029	0.0012	-58.6%
minDCF (P=0.001)	0.0006	0.0004	-33.3%
FAR @ FRR=1%	4.53%	3.75%	-17.2%
FAR @ FRR=5%	0.75%	0.70%	-6.7%
AUC	0.9971	—	—

💡 AS-Norm = Adaptive Symmetric Score Normalization with a fixed cohort from the training set. HIGHLY RECOMMENDED for maximum accuracy.

Production Deployment Tiers

Use Case	Metric	Value	Status
Inconvenience (Voice Assistant)	FAR @ FRR=5%	0.70%	✅ Ready
Balanced (Banking)	FAR @ FRR=1%	3.75%	✅ Ready
High Security (Access Control)	EER	1.67%	✅ Ready

📊 Training Data

vviEar-V1.0 was trained on a carefully curated multilingual corpus:

Source	Utterances	Languages	Domain
VoxCeleb2	~1.2M	English	Celebrity interviews
Multilingual (Cleaned)	~800K	10+ languages	Broadcast media
Vietnamese Internal	~1M	Vietnamese	Multi-dialect conversational
Total	~3M	11+	Diverse

Dataset Statistics:

Total Speakers: 8,686
Average Samples/Speaker: ~403
Audio Format: 16kHz, Mono
Minimum Duration: 2 seconds

🛠️ Model Architecture

Component	Specification
Backbone	ECAPA-TDNN
Channels	[1024, 1024, 1024, 1024, 3072]
Embedding Dimension	192
Input Features	80-dim Mel-filterbanks (16kHz)
Parameters	~20.8 Million
Pre-training	VoxCeleb2 (0.87% EER baseline)

Training Strategy

Aspect	Configuration
Loss Function	Pure Supervised Contrastive Loss (SupCon)
Batching	Differentiable Global Gather (B=128 effective negatives)
Temperature	Step-wise Annealing (τ: 0.10 → 0.07)
Optimizer	AdamW (lr=5e-5, weight_decay=1e-4)
Scheduler	Cosine Decay
Augmentation	Simple Noise/Music, Random Crop and Time-Domain Masking
Training Duration	14 epochs on 4× NVIDIA A6000

💻 Quick Start

Installation

# 1. Install ffmpeg for torchaudio
sudo apt-get update && sudo apt-get install -y ffmpeg

# 2. Install PyTorch (CUDA 12.6 version)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126

# 3. Install Audio & Model dependencies
pip install speechbrain huggingface_hub scipy "numpy<2.0"

### Inference Code

```python
import torch
import torchaudio
import torchaudio.transforms as T
import torch.nn.functional as F
import json
import os
from typing import Union, List, Optional
from speechbrain.lobes.models.ECAPA_TDNN import ECAPA_TDNN
from speechbrain.lobes.features import Fbank
from speechbrain.processing.features import InputNormalization
from huggingface_hub import hf_hub_download

# ═════════════════════════════════════════════════════════════════════════
# 1. CORE UTILS
# ═════════════════════════════════════════════════════════════════════════

def safe_normalize(x, p=2.0, dim=1, eps=1e-6):
    """
    Safe L2 normalization for mixed-precision stability.
    Matches training logic exactly.
    """
    return F.normalize(x, p=p, dim=dim, eps=eps)

def safe_load_audio(
    waveform: torch.Tensor,
    sr: int,
    target_sr: int = 16000,
    peak_clip: bool = True,
    global_gain: float = None,
):
    """
    Safe waveform normalization for speaker embedding.
    """
    # Convert to mono
    if waveform.dim() == 2 and waveform.size(0) > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    elif waveform.dim() == 1:
        waveform = waveform.unsqueeze(0)
    
    # Resample if needed
    if sr != target_sr:
        waveform = T.Resample(sr, target_sr)(waveform)
        sr = target_sr
    
    # Peak normalization
    if peak_clip:
        waveform = torch.clamp(waveform, -1.0, 1.0)
    
    # Optional global gain
    if global_gain is not None:
        waveform = waveform * global_gain
    
    return waveform, sr

# ═════════════════════════════════════════════════════════════════════════
# 2. AS-NORM MODULE (RESTORED)
# ═════════════════════════════════════════════════════════════════════════

class ASNorm:
    """
    Adaptive Symmetric Score Normalization (AS-Norm).
    """
    def __init__(self, cohort_embeddings, top_k=200):
        self.cohort = cohort_embeddings  # [N, D]
        self.top_k = top_k
        self.device = self.cohort.device

    def normalize(self, enrollment_emb, test_emb):
        """
        Apply AS-Norm to a verification score.
        Supports both single pair (1D) and batch (2D) inputs.
        """
        # Ensure inputs are 2D [B, D] for consistent matrix ops
        if enrollment_emb.dim() == 1:
            enrollment_emb = enrollment_emb.unsqueeze(0)
        if test_emb.dim() == 1:
            test_emb = test_emb.unsqueeze(0)
            
        # 1. Raw Cosine Score [B]
        raw_score = torch.nn.functional.cosine_similarity(enrollment_emb, test_emb, dim=1)
        
        # 2. Enrollment vs Cohort
        # [B, D] @ [D, N_cohort] -> [B, N_cohort]
        e_scores = torch.mm(enrollment_emb, self.cohort.T)
        e_topk = torch.topk(e_scores, self.top_k, dim=1).values
        e_mean = e_topk.mean(dim=1)
        e_std = e_topk.std(dim=1)
        
        # 3. Test vs Cohort
        t_scores = torch.mm(test_emb, self.cohort.T)
        t_topk = torch.topk(t_scores, self.top_k, dim=1).values
        t_mean = t_topk.mean(dim=1)
        t_std = t_topk.std(dim=1)
        
        # 4. AS-Norm Formula (with epsilon for stability)
        norm_score = 0.5 * ((raw_score - e_mean) / (e_std + 1e-6) + (raw_score - t_mean) / (t_std + 1e-6))
        
        # Return scalar if input was 1D (batch=1), else tensor
        if norm_score.numel() == 1:
            return norm_score.item()
        return norm_score

# ═════════════════════════════════════════════════════════════════════════
# 3. PRODUCTION ENGINE
# ═════════════════════════════════════════════════════════════════════════

class SpeakerEncoder:
    def __init__(self, repo_id="your-username/viEar-V1.0", device=None, local_path=None, top_k=200):
        self.device = device if device else ("cuda" if torch.cuda.is_available() else "cpu")
        print(f"🚀 Initializing Production Engine on {self.device}...")

        # 1. Load Config & Weights
        if local_path:
            config_path = os.path.join(local_path, "config.json")
            model_path = os.path.join(local_path, "pytorch_model.bin")
            cohort_path = os.path.join(local_path, "cohort_embeddings_balanced.pt")
            
        else:
            config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
            model_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
            cohort_path = hf_hub_download(repo_id=repo_id, filename="cohort_embeddings_balanced.pt")
        
        self.load_cohort(cohort_path, top_k=top_k)

        with open(config_path, "r") as f:
            config = json.load(f)

        # Cleanup config params
        config.pop("author", None)
        config.pop("email", None)
        
        # 2. Init Backbone
        self.model = ECAPA_TDNN(**config).to(self.device)
        
        # 🛑 FORCE FLOAT32
        self.model.float() 
        self.model.eval()
        
        # 3. Load State Dict (Handle wrapper prefixes)
        state_dict = torch.load(model_path, map_location=self.device)
        if "backbone.conv1.w" in list(state_dict.keys())[0]:
            new_state_dict = {k.replace("backbone.", ""): v for k, v in state_dict.items() if k.startswith("backbone.")}
            state_dict = new_state_dict
            
        self.model.load_state_dict(state_dict)
        self.model.eval()
        
        # 4. Features (Stateless)
        self.fbank_extractor = Fbank(sample_rate=16000, n_mels=80).to(self.device)
        self.feature_norm = InputNormalization(norm_type="sentence", std_norm=False).to(self.device)
        
        print("✅ Model ready.")

    def load_cohort(self, cohort_path, top_k=200):
        """
        Loads cohort embeddings for AS-Norm scoring.
        """
        print(f"   Loading cohort from {cohort_path}...")
        try:
            cohort_emb = torch.load(cohort_path, map_location=self.device)
            # Ensure cohort is float32 for scoring precision
            cohort_emb = cohort_emb.float()
            self.as_norm = ASNorm(cohort_emb, top_k=top_k)
            print(f"✅ AS-Norm enabled (Cohort size: {cohort_emb.size(0)})")
        except Exception as e:
            print(f"❌ Failed to load cohort: {e}")

    def _prepare_batch(self, inputs: Union[str, torch.Tensor, List[Union[str, torch.Tensor]]]):
        """
        Internal: Handles loading, VAD, Mono-conversion, and PADDING.
        """
        if not isinstance(inputs, list):
            inputs = [inputs]

        processed_wavs = []
        
        for item in inputs:
            # A. LOAD
            if isinstance(item, str):
                wav, sr = torchaudio.load(item)
            elif isinstance(item, torch.Tensor):
                wav = item
                sr = 16000 # Assumption
            else:
                continue 

            # B. APPLY SAFE LOAD (VAD + PREPROCESS)
            wav, _ = safe_load_audio(
                waveform=wav, 
                sr=sr, 
                target_sr=16000, 
                peak_clip=True, 
            )
            
            if wav.dim() == 1:
                wav = wav.unsqueeze(0)
            
            processed_wavs.append(wav.squeeze(0)) 

        if not processed_wavs:
            return None, None

        # C. PAD TO MAX LENGTH
        orig_lengths = [w.size(0) for w in processed_wavs]
        max_len = max(orig_lengths)
        
        padded_wavs = []
        for w in processed_wavs:
            if w.size(0) < max_len:
                pad_amt = max_len - w.size(0)
                w = F.pad(w, (0, pad_amt), value=0.0)
            padded_wavs.append(w)
            
        batch_tensor = torch.stack(padded_wavs)
        
        # D. RELATIVE LENGTHS (For InputNormalization)
        lengths_tensor = torch.tensor(
            [l / max_len for l in orig_lengths], 
            dtype=torch.float32, 
            device=self.device
        )
        
        return batch_tensor.to(self.device), lengths_tensor

    def extract_fbank_features(self, waveforms: torch.Tensor, lengths: torch.Tensor):
        # 1. Extract
        feats = self.fbank_extractor(waveforms)
        # 2. Safety Pad
        if feats.size(1) < 5:
            pad = 5 - feats.size(1)
            feats = F.pad(feats, (0, 0, 0, pad))
        # 3. Input Norm (Masked)
        feats = self.feature_norm(feats, lengths)
        return feats

    @torch.no_grad()
    def embed(self, inputs: Union[str, torch.Tensor, List[Union[str, torch.Tensor]]]):
        # 1. Prepare Batch
        waveforms, lengths = self._prepare_batch(inputs)
        if waveforms is None: return None

        # 2. Extract Features
        feats = self.extract_fbank_features(waveforms, lengths)
        
        # 3. Backbone
        emb = self.model(feats)
        if emb.dim() == 3: emb = emb.squeeze(1)
            
        # 4. Safe Normalize
        emb = safe_normalize(emb, p=2.0, dim=1, eps=1e-6)
        
        return emb

    def compute_score(self, emb1, emb2, use_asnorm=False):
        """
        Automatic Scoring with Dimension Safety.
        """
        # 1. Fix Dimensions (The Logic Patch)
        # If input is 1D [192] (from slicing a batch), make it 2D [1, 192]
        if emb1.dim() == 1:
            emb1 = emb1.unsqueeze(0)
        if emb2.dim() == 1:
            emb2 = emb2.unsqueeze(0)

        # 2. Normalize (Now safe because shape is [N, D])
        emb1 = safe_normalize(emb1.to(self.device), eps=1e-6)
        emb2 = safe_normalize(emb2.to(self.device), eps=1e-6)

        # 3. AS-Norm
        if self.as_norm is not None and use_asnorm:
            return self.as_norm.normalize(emb1, emb2)
        
        # 4. Fallback Raw Cosine
        # Returns float if single pair, tensor if batch
        score = F.cosine_similarity(emb1, emb2, dim=1)
        if score.numel() == 1:
            return score.item()
        return score

# Usage Example
engine = SpeakerEncoder(repo_id=repo_id, device="cuda")

# 2. INFERENCE
path1 = "./samples/vi_female_north_01.wav"
path2 = "./samples/vi_female_north_02.wav"

embeddings = engine.embed([path1, path2])

use_asnorm = True
# ═════════════════════════════════════════════════════════════════════════
# 4. HOW TO RUN (EXAMPLE)
# ═════════════════════════════════════════════════════════════════════════
if embeddings is not None:
    e1, e2 = embeddings[0], embeddings[1]
    
    # 3. SCORE (Automatically uses AS-Norm if cohort loaded)
    score = engine.compute_score(e1, e2, use_asnorm=use_asnorm)
    
    print(f"\nFinal Score: {score:.4f}")
    if engine.as_norm and use_asnorm:
        print(f"Verdict: {'✅ ACCEPT' if score > 3.5 else '❌ REJECT'}")
    else:
        print(f"Verdict: {'✅ ACCEPT' if score > 0.75 else '❌ REJECT'}")

⚙️ Production Deployment

Recommended Settings

Parameter	Value	Notes
Minimum Audio Length	2.0 seconds	Optimal: 3-5 seconds
Sample Rate	16 kHz	Resample if different
Normalization	Sentence-level mean	Required
Embedding L2-Norm	Yes	Critical for scoring

Adaptive Score Normalization (AS-Norm)

✅ FAR @ FRR=1% raw cosine (without As-Norm):
- (Raw Cosine): Range [-1.0, 1.0].
- Score > 0.85: High Confidence
- Score 0.75 - 0.85: Standard Threshold
- Score < 0.75: Rejection

For maximum accuracy (1.67% EER), STRONGLY RECOMMENDED to implement AS-Norm:

✅ AS-Norm Z-Score:
- Range [-inf, +inf]
- Score > 4.0: High Confidence
- Match Score > 3.0: Standard Threshold
- Match Score 2.0 - < 3.0: Uncertain
- Match Score < 0.0: Strong Rejection (Pair is less similar than random impostors).
✅ High Confidence can be used in Root Access / Financial Transfer
✅ Standard Threshold can be used for Device Unlock / 2FA Bypass.

🗺️ Roadmap

v1.0 (Current Release)

✅ Pure SupCon training with Differentiable Global Gather
✅ 1.67% EER with AS-Norm
✅ 11+ language support

v2.0 (Planned)

🔄 Sub-Center ArcFace fine-tuning (Target: EER < 1.0%)
🔄 Integrated anti-spoofing module (TTS/Deepfake detection)
🔄 Extended Asian language support (Japanese, Korean, Thai, Indonesian)

Anti-Spoofing (v2.0)

With the proliferation of high-quality TTS systems, anti-spoofing is critical for production deployment. v2.0 will include:

Bonafide/Spoof Classifier: Trained on ASVspoof 2019/2021 + internal TTS samples
Joint Pipeline: Speaker verification + spoofing detection in unified inference
TTS-Aware Training: Negative samples generated from internal TTS models

📜 Citation

If you use this model in your research or production systems, please cite:

@misc{nguyen2025viear,
  title={viEar-V1.0: State-of-the-Art Multilingual Speaker Verification via Pure Supervised Contrastive Learning},
  author={Nguyễn Anh Nguyên},
  organization={BRIGHTO JS Company},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/thusinh1969/viEar-V1.0}
}

📧 Contact

Author: Nguyễn Anh Nguyên
Email: nguyen@hatto.com
Organization: BRIGHTO JS Company

📄 License

This model is released under the afl-3.0 Academic FreeLicense.

Built with ❤️ by BRIGHTO

Downloads last month: -

Evaluation results

EER
self-reported

1.670
minDCF
self-reported

0.001

Metadata error: specify a dataset to view leaderboard