Post
1437
🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality
Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.
📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss
🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters
# Assess speech-transcript alignment
Perfect for filtering synthetic speech datasets before ASR training.
Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment
Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.
📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss
🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters
from transformers import AutoModel, AutoProcessor
processor = AutoProcessor.from_pretrained(
"yuriyvnv/WAVe-1B-Multimodal-PT",
trust_remote_code=True
)
model = AutoModel.from_pretrained(
"yuriyvnv/WAVe-1B-Multimodal-PT",
trust_remote_code=True
)# Assess speech-transcript alignment
inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
quality = model(**inputs).quality_score.item()Perfect for filtering synthetic speech datasets before ASR training.
Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment