PASE: Phonologically Anchored Speech Enhancer

PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.

Model Details

Model Description

PASE contains two main components:

Denoising WavLM (DeWavLM)
Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.
Dual‑Stream Vocoder
Reconstructs audio using DeWavLM's dual-stream representations:
- Phonetic representation: high-level linguistic structure
- Acoustic representation: speaker identity and prosody

Developed by: Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
Cisco product group: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
Model type: Generative Speech Enhancement
License: Apache 2.0
Finetuned from: WavLM-Large

Model Sources

Repository: https://github.com/cisco-open/pase
Paper: https://arxiv.org/abs/2511.13300
Demo: https://xiaobin-rong.github.io/pase_demo/

Uses

Direct Use

Enhance noisy or reverberant speech recordings
Improve perceptual quality and intelligibility
Preserve speaker identity and linguistic content
Supports 16 kHz mono audio

Out-of-Scope Use

Medical, legal, or safety‑critical decisions
Voice conversion or identity manipulation
Non‑speech audio enhancement

How to Get Started

Refer to the repository for quick-start code and examples:
https://github.com/cisco-open/pase

Training Details

Training Data

We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:

Clean speech:
- DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
- LibriTTS
- VCTK
Noise:
- DNS5 Challenge noise resources
Room impulse responses:
- OpenSLR26
- OpenSLR28

These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.

Dataset Attribution

DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from LibriVox through the DNS Challenge. The LibriVox recordings used for this portion are public domain and were used as clean-speech training data for the released checkpoint.
LibriTTS: LibriTTS by Heiga Zen et al., licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
VCTK Corpus: the VCTK dataset from the Centre for Speech Technology Research, University of Edinburgh, licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
DNS5 Challenge noise resources: noise data prepared through the DNS Challenge and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on AudioSet material licensed under CC BY 4.0, selected Freesound files licensed under CC0 1.0, and DEMAND environmental recordings licensed under CC BY-SA 3.0.
OpenSLR26 and OpenSLR28: OpenSLR26 and OpenSLR28 room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.

All audio was resampled to 16 kHz.

Training Procedure

Preprocessing

Mixtures generated dynamically
SNR sampled from –5 to 15 dB
Reverberation applied with 50% probability

Training Hyperparameters

DeWavLM: 100k steps, LR 1e‑4, batch size 4
Vocoder: 200k steps, LR 2e‑4, batch size 12
Optimizer: AdamW with warmup + cosine decay
Hardware: 4 × NVIDIA RTX 4090 GPUs

Speeds, Sizes, Times

Total parameters: ~382M
Inference compute: ~21.4 GMAC/s

Evaluation

Testing Data

Simulated LibriTTS test set (using test split)
DNS1 test set with/without reverberation

Metrics

DNSMOS, UTMOS
LPS, SpeechBERTScore (SBS)
Speaker Similarity (RawNet3)
WER (OWSM v3.1)

Results

The performance of the released version compared to the paper's results:

Model	DNSMOS	UTMOS	SBS	LPS	SpkSim	WER (%)
Vocoder-L24 (paper)	3.23	3.40	0.94	0.97	0.65	2.86
Vocoder-L24 (released)	3.29	3.30	0.94	0.96	0.59	3.46
DeWavLM (paper)	3.26	3.42	0.88	0.93	0.57	7.62
DeWavLM (released)	3.31	3.39	0.88	0.93	0.52	7.25
PASE (paper)	3.12	3.09	0.90	0.93	0.80	7.49
PASE (released)	3.08	3.21	0.91	0.94	0.80	6.76

It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.

Overall, PASE achieves:

Lowest WER among evaluated generative and discriminative baselines
Highest speaker similarity (SpkSim)
Strong perceptual quality with low hallucination rates
Consistent performance across noisy and reverberant conditions

Bias, Risks, and Limitations

Model trained primarily on English speech; performance may degrade for other languages.
Very strong noise or mismatched reverberation conditions can introduce artifacts.
Speaker characteristics are preserved but not guaranteed perfectly.

Recommendations

Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.

Citation

If you use PASE in your research, please cite:

@article{PASE, 
    title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
    volume={40},
    DOI={10.1609/aaai.v40i39.40562}, 
    number={39}, 
    journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
    author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, 
    year={2026},
    month={Mar.}, 
    pages={32826-32834}
}

Model Card Authorship & Contact

Mansur Yesilbursa: myesilbu@cisco.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for cisco-ai/pase

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Paper • 2511.13300 • Published Nov 17, 2025