PASE: Phonologically Anchored Speech Enhancer
PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.
Model Details
Model Description
PASE contains two main components:
Denoising WavLM (DeWavLM)
Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.Dual‑Stream Vocoder
Reconstructs audio using DeWavLM's dual-stream representations:- Phonetic representation: high-level linguistic structure
- Acoustic representation: speaker identity and prosody
Developed by: Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
Cisco product group: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
Model type: Generative Speech Enhancement
License: Apache 2.0
Finetuned from: WavLM-Large
Model Sources
- Repository: https://github.com/cisco-open/pase
- Paper: https://arxiv.org/abs/2511.13300
- Demo: https://xiaobin-rong.github.io/pase_demo/
Uses
Direct Use
- Enhance noisy or reverberant speech recordings
- Improve perceptual quality and intelligibility
- Preserve speaker identity and linguistic content
- Supports 16 kHz mono audio
Out-of-Scope Use
- Medical, legal, or safety‑critical decisions
- Voice conversion or identity manipulation
- Non‑speech audio enhancement
How to Get Started
Refer to the repository for quick-start code and examples:
https://github.com/cisco-open/pase
Training Details
Training Data
We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:
- Clean speech:
- Noise:
- DNS5 Challenge noise resources
- Room impulse responses:
These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.
Dataset Attribution
- DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from LibriVox through the DNS Challenge. The LibriVox recordings used for this portion are public domain and were used as clean-speech training data for the released checkpoint.
- LibriTTS: LibriTTS by Heiga Zen et al., licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
- VCTK Corpus: the VCTK dataset from the Centre for Speech Technology Research, University of Edinburgh, licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
- DNS5 Challenge noise resources: noise data prepared through the DNS Challenge and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on AudioSet material licensed under CC BY 4.0, selected Freesound files licensed under CC0 1.0, and DEMAND environmental recordings licensed under CC BY-SA 3.0.
- OpenSLR26 and OpenSLR28: OpenSLR26 and OpenSLR28 room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.
All audio was resampled to 16 kHz.
Training Procedure
Preprocessing
- Mixtures generated dynamically
- SNR sampled from –5 to 15 dB
- Reverberation applied with 50% probability
Training Hyperparameters
- DeWavLM: 100k steps, LR 1e‑4, batch size 4
- Vocoder: 200k steps, LR 2e‑4, batch size 12
- Optimizer: AdamW with warmup + cosine decay
- Hardware: 4 × NVIDIA RTX 4090 GPUs
Speeds, Sizes, Times
- Total parameters: ~382M
- Inference compute: ~21.4 GMAC/s
Evaluation
Testing Data
- Simulated LibriTTS test set (using test split)
- DNS1 test set with/without reverberation
Metrics
- DNSMOS, UTMOS
- LPS, SpeechBERTScore (SBS)
- Speaker Similarity (RawNet3)
- WER (OWSM v3.1)
Results
The performance of the released version compared to the paper's results:
| Model | DNSMOS | UTMOS | SBS | LPS | SpkSim | WER (%) |
|---|---|---|---|---|---|---|
| Vocoder-L24 (paper) | 3.23 | 3.40 | 0.94 | 0.97 | 0.65 | 2.86 |
| Vocoder-L24 (released) | 3.29 | 3.30 | 0.94 | 0.96 | 0.59 | 3.46 |
| DeWavLM (paper) | 3.26 | 3.42 | 0.88 | 0.93 | 0.57 | 7.62 |
| DeWavLM (released) | 3.31 | 3.39 | 0.88 | 0.93 | 0.52 | 7.25 |
| PASE (paper) | 3.12 | 3.09 | 0.90 | 0.93 | 0.80 | 7.49 |
| PASE (released) | 3.08 | 3.21 | 0.91 | 0.94 | 0.80 | 6.76 |
It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.
Overall, PASE achieves:
- Lowest WER among evaluated generative and discriminative baselines
- Highest speaker similarity (SpkSim)
- Strong perceptual quality with low hallucination rates
- Consistent performance across noisy and reverberant conditions
Bias, Risks, and Limitations
- Model trained primarily on English speech; performance may degrade for other languages.
- Very strong noise or mismatched reverberation conditions can introduce artifacts.
- Speaker characteristics are preserved but not guaranteed perfectly.
Recommendations
Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.
Citation
If you use PASE in your research, please cite:
@article{PASE,
title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
volume={40},
DOI={10.1609/aaai.v40i39.40562},
number={39},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing},
year={2026},
month={Mar.},
pages={32826-32834}
}
Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
Model Card Authorship & Contact
- Mansur Yesilbursa: myesilbu@cisco.com