Real-world data point: VibeVoice-ASR on Twitter/X Spaces audio (crypto-AI vocabulary)

#24

by AOK-Web3 - opened 6 days ago

Sharing some real-world observations from running VibeVoice-ASR-7B on Twitter/X Spaces audio (multi-speaker live audio rooms, frequently with crypto/AI/Web3 vocabulary). May be useful as an out-of-distribution data point alongside the standard benchmarks. Not a bug report — the model performs well overall.

What I tested

4 Twitter/X Spaces captures, ~3.5 hours of audio total
Topics: Web3/crypto/agentic-AI conversations
Audio characteristics: AAC-encoded m4a from yt-dlp, multi-speaker, music intros, occasional packet-loss artifacts
24 GB GPU (RTX 4090), --attn_implementation sdpa, microsoft/VibeVoice-ASR weights

What worked well

Diarization — clean speaker boundaries within each segment, distinct speaker IDs assigned reliably
Music handling — [Music] tags emitted for music intros instead of attempting to transcribe them or generating filler text. Notably better than what whisper-large-v3 tends to do in similar conditions (which often produces "thank you for watching" / random hallucinations during silence).
Punctuation and casing — sentence boundaries and capitalization look plausible, no obvious all-caps or all-lowercase artifacts.
Timestamps — sensible 30-second-ish granularity, breaks at natural sentence pauses.

Hallucination / substitution patterns observed

The model's failure modes on this domain seem concentrated in named-entity substitution for AI / crypto / Web3 vocabulary, including handles. Concrete examples from two Spaces:

Space 1: "The Rise of Agentic AI" (~30 min) — AI vocabulary

"Anthropic" → "Entropic" (consistent across multiple occurrences in the same Space):

"Mythos from Entropic literally got out of their own little secure environment and bragged about it afterwards."

The speaker is referring to Anthropic (the company that makes Claude) — context makes this unambiguous.
"OpenClaude" / "Claude Code" → "Open Claw" (multiple occurrences):

"We're actually leveraging Open Claw so that we can use literally any model that we want."

Likely refers to a Claude Code / OpenClaude integration — the speaker discussed Anthropic's Claude product earlier.
"agentic" → "GenTech" (in the topic title, even):

"We got one, two, three, four guests supposed to be here today talking about the GenTech rise of AI."

The Space's actual title is "The Rise of Agentic AI." The model misheard the topic word.

Space 2: "The Alpha Den Ep. 22" / Saga DAO Alpha Club (~1h 47m) — crypto / Web3 vocabulary + handles

"DAO" → "Dell" (one occurrence among multiple correct ones — inconsistent within the same Space):

"As we mentioned, the Saga Dell and the Alpha Club have been absolutely cooking."

The project is "Saga DAO Alpha Club" — and the same transcript correctly produces "Saga DAO" three other times in the same Space. Inconsistent treatment of an in-context, repeated noun.
"the DAO and" → "the Down the" (multi-word substitution):

"...both the Down the Alpha Club, I've been simultaneously cooking..."

The speaker said "both the DAO and the Alpha Club" — the model dropped the "DAO and" and replaced it with "Down the." More involved than a single-token swap.
oldmanstillcan (Twitter handle) → "Old Man Still Can" (6 occurrences in a single Space):

"...check in with our brother Old Man Still Can..."

The host's Twitter/X handle is @oldmanstillcan — a single concatenated string. The model consistently splits it into 4 separate capitalized words. This may be an interesting failure mode in its own right: social-media handles are pronounced as a continuous string but the model is segmenting on word boundaries that don't actually exist in the source string.

These look like the standard ASR failure mode where novel domain terminology is mapped to phonetically-nearby in-distribution words. None of the substitutions are absurd — they're all plausible-but-wrong, which is the harder failure mode. The handle-segmentation pattern (#6) feels like a specific subset worth thinking about for social-audio domains where handles get said aloud constantly.

Why this might be useful

Twitter/X Spaces are a substantial corpus of multi-speaker conversational audio in technical domains (crypto, AI, startups, tech opinion shows) — and to my knowledge, audio of this character isn't represented in the standard ASR benchmarks (LibriSpeech, AMI, AISHELL, MLC-Challenge per the docs). Real-world deployment on this content surfaces vocabulary failure modes that benchmark numbers wouldn't predict.

If Microsoft Research is interested in building out evaluation coverage for live audio rooms or crypto/AI conversational content, happy to share more samples. The audio is publicly accessible (these are public Twitter Spaces), so it's straightforward to share via URL or audio file.

Setup details for reproducibility

git clone https://github.com/microsoft/VibeVoice
cd VibeVoice
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

yt-dlp -x --audio-format m4a "https://x.com/i/spaces/<id>"

python demo/vibevoice_asr_inference_from_file.py \
  --model_path microsoft/VibeVoice-ASR \
  --audio_files <space>.m4a \
  --device cuda --attn_implementation sdpa

Thanks for the open-source release — VibeVoice-ASR is the strongest single-pass long-form ASR model I've seen with built-in diarization. The fact that the failure modes are interesting at all (rather than catastrophic) is a credit to the underlying work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment