Yuriy Perezhohin's picture

Open to Collab

Yuriy Perezhohin PRO

yuriyvnv

·

https://scholar.google.com/citations?user=I5uzFtwAAAAJ&hl=en

AI & ML interests

Automatic Speech Recognition, Embeddings, Code Generation, Synthetic Data Generation and Filtering

Recent Activity

replied to their post about 24 hours ago

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss. 📊 Impact on Portuguese ASR: • 34% reduction in training steps • 50% better cross-domain generalization • 30% less synthetic data needed • Word-aligned attention finds errors other methods miss 🏗️ Architecture: • Text: XLM-RoBERTa (278M params) • Audio: Wav2Vec2-BERT 2.0 (581M params) • Word Alignment: Multi-head attention + GLU (14M params) • Total: 1B parameters ``` from transformers import AutoModel, AutoProcessor processor = AutoProcessor.from_pretrained( "yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True ) model = AutoModel.from_pretrained( "yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True ) ``` # Assess speech-transcript alignment ``` inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt") quality = model(**inputs).quality_score.item() ``` Perfect for filtering synthetic speech datasets before ASR training. Model: https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT Code to create WAVe : https://github.com/yuriyvnv/WAVe #multimodal #speech #embeddings #asr #syntheticdata #qualityassessment

updated a model 1 day ago

yuriyvnv/WAVe-1B-Multimodal-PT

posted an update 1 day ago

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss. 📊 Impact on Portuguese ASR: • 34% reduction in training steps • 50% better cross-domain generalization • 30% less synthetic data needed • Word-aligned attention finds errors other methods miss 🏗️ Architecture: • Text: XLM-RoBERTa (278M params) • Audio: Wav2Vec2-BERT 2.0 (581M params) • Word Alignment: Multi-head attention + GLU (14M params) • Total: 1B parameters ``` from transformers import AutoModel, AutoProcessor processor = AutoProcessor.from_pretrained( "yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True ) model = AutoModel.from_pretrained( "yuriyvnv/WAVe-1B-Multimodal-PT", trust_remote_code=True ) ``` # Assess speech-transcript alignment ``` inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt") quality = model(**inputs).quality_score.item() ``` Perfect for filtering synthetic speech datasets before ASR training. Model: https://huggingface.co/yuriyvnv/WAVe-1B-Multimodal-PT Code to create WAVe : https://github.com/yuriyvnv/WAVe #multimodal #speech #embeddings #asr #syntheticdata #qualityassessment

View all activity

Organizations

replied to their post about 24 hours ago

Hello everyone, yesterday there were minor problems that prevented the usage of the Embedding model. Mainly because of the Processor Class.
Posting here that the team has already solved the bugs.
If there is any problem with your usage, first delete the cached model (.cache folder in Hugging Face), redownload it, and if the issue persists, post a thread on the model page.

posted an update 1 day ago

Post

1797

🎯 WAVe: 1B Multimodal Embedding Model for Word-Level Speech Quality

Multimodal embeddings for speech + transcript that verify quality at the word level, not just sentence level. Catches mispronunciations, timing errors, and prosody issues that sentence-level filters miss.

📊 Impact on Portuguese ASR:
• 34% reduction in training steps
• 50% better cross-domain generalization
• 30% less synthetic data needed
• Word-aligned attention finds errors other methods miss

🏗️ Architecture:
• Text: XLM-RoBERTa (278M params)
• Audio: Wav2Vec2-BERT 2.0 (581M params)
• Word Alignment: Multi-head attention + GLU (14M params)
• Total: 1B parameters

from transformers import AutoModel, AutoProcessor

  processor = AutoProcessor.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )
  model = AutoModel.from_pretrained(
      "yuriyvnv/WAVe-1B-Multimodal-PT",
      trust_remote_code=True
  )

# Assess speech-transcript alignment

inputs = processor(text="Olá, como está?", audio=audio_array, sampling_rate=16000, return_tensors="pt")
  quality = model(**inputs).quality_score.item()

Perfect for filtering synthetic speech datasets before ASR training.

Model: yuriyvnv/WAVe-1B-Multimodal-PT
Code to create WAVe : https://github.com/yuriyvnv/WAVe
#multimodal #speech #embeddings #asr
#syntheticdata #qualityassessment

1 reply

·