🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality
Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.
📊 Impact on Portuguese ASR: • 34% reduction in training steps • 50% better cross-domain generalization • 30% less synthetic data needed • Word-aligned attention finds errors other methods miss