Papers
arxiv:2605.00777

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Published on May 1
· Submitted by
Venkata Pushpak Teja Menta
on May 4

Abstract

A language-adversarial speaker encoder (LASE) is proposed to address cross-script voice cloning issues by training with contrastive loss and gradient-reversal learning to produce language-uninformative yet speaker-informative embeddings.

AI-generated summary

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

Community

Paper author Paper submitter

LASE is a language-adversarial speaker encoder for cross-script identity preservation in Indic TTS. The problem: when a multilingual TTS clones the same voice across scripts (English → Hindi → Telugu → Tamil), speaker identity drifts measurably (within-script SECS 0.93 → cross-script 0.83, vs the across-speaker noise floor of 0.64). LASE wraps a frozen speaker-encoder backbone (WavLM-base-plus or ECAPA-TDNN) with a gradient-reversal-layer language classifier that strips language-specific signal from the speaker embedding while preserving identity. On a 1118-pair Indian-voice held-out, LASE r1 cuts cross-script drift 5× and shows 3× headroom over the within-script ceiling. Both the WavLM backbone and the GRL ablation contribute; ECAPA+GRL replicates the finding. Paper, code, and weights are open-source.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.00777
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.00777 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.00777 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.00777 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.