Audio-to-Audio

PASE: Phonologically Anchored Speech Enhancer

PASE is a state-of-the-art generative speech enhancement model trained to remove noise and reverberation while preserving linguistic content and speaker identity. It operates on 16 kHz mono audio.


Model Details

Model Description

High-level system design

PASE contains two main components:

  • Denoising WavLM (DeWavLM)
    Fine‑tuned from WavLM‑Large using denoising representation distillation (DRD).
    Performs robust noise supression while effectively mitigating linguistic hallucinations by leveraging the phonological prior from self-supervised WavLM.

  • Dual‑Stream Vocoder
    Reconstructs audio using DeWavLM's dual-stream representations:

    • Phonetic representation: high-level linguistic structure
    • Acoustic representation: speaker identity and prosody

Developed by: Copyright © 2026 by Cisco Systems, Inc. All rights reserved.
Cisco product group: Collaboration AI: Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki
Model type: Generative Speech Enhancement
License: Apache 2.0
Finetuned from: WavLM-Large


Model Sources


Uses

Direct Use

  • Enhance noisy or reverberant speech recordings
  • Improve perceptual quality and intelligibility
  • Preserve speaker identity and linguistic content
  • Supports 16 kHz mono audio

Out-of-Scope Use

  • Medical, legal, or safety‑critical decisions
  • Voice conversion or identity manipulation
  • Non‑speech audio enhancement

How to Get Started

Refer to the repository for quick-start code and examples:
https://github.com/cisco-open/pase


Training Details

Training Data

We release a PASE checkpoint that has been trained on an updated list of datasets. For this release, training used:

  • Clean speech:
    • DNS5 Challenge clean-speech resources derived from the LibriVox public-domain subset
    • LibriTTS
    • VCTK
  • Noise:
    • DNS5 Challenge noise resources
  • Room impulse responses:

These source datasets were used to prepare training mixtures and train the released model. The model card and repository do not redistribute the underlying dataset contents; please refer to the original dataset pages and licenses below.

Dataset Attribution

  • DNS5 Challenge clean speech (LibriVox subset): clean-speech material prepared from LibriVox through the DNS Challenge. The LibriVox recordings used for this portion are public domain and were used as clean-speech training data for the released checkpoint.
  • LibriTTS: LibriTTS by Heiga Zen et al., licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
  • VCTK Corpus: the VCTK dataset from the Centre for Speech Technology Research, University of Edinburgh, licensed under CC BY 4.0. It was used as clean-speech training data for the released checkpoint.
  • DNS5 Challenge noise resources: noise data prepared through the DNS Challenge and used to synthesize noisy training mixtures for the released checkpoint. For this release, the DNS5 noise resources draw on AudioSet material licensed under CC BY 4.0, selected Freesound files licensed under CC0 1.0, and DEMAND environmental recordings licensed under CC BY-SA 3.0.
  • OpenSLR26 and OpenSLR28: OpenSLR26 and OpenSLR28 room impulse response resources, both licensed under Apache 2.0, were used to add reverberation during training.

All audio was resampled to 16 kHz.

Training Procedure

Preprocessing

  • Mixtures generated dynamically
  • SNR sampled from –5 to 15 dB
  • Reverberation applied with 50% probability

Training Hyperparameters

  • DeWavLM: 100k steps, LR 1e‑4, batch size 4
  • Vocoder: 200k steps, LR 2e‑4, batch size 12
  • Optimizer: AdamW with warmup + cosine decay
  • Hardware: 4 × NVIDIA RTX 4090 GPUs

Speeds, Sizes, Times

  • Total parameters: ~382M
  • Inference compute: ~21.4 GMAC/s

Evaluation

Testing Data

Metrics

  • DNSMOS, UTMOS
  • LPS, SpeechBERTScore (SBS)
  • Speaker Similarity (RawNet3)
  • WER (OWSM v3.1)

Results

The performance of the released version compared to the paper's results:

Model DNSMOS UTMOS SBS LPS SpkSim WER (%)
Vocoder-L24 (paper) 3.23 3.40 0.94 0.97 0.65 2.86
Vocoder-L24 (released) 3.29 3.30 0.94 0.96 0.59 3.46
DeWavLM (paper) 3.26 3.42 0.88 0.93 0.57 7.62
DeWavLM (released) 3.31 3.39 0.88 0.93 0.52 7.25
PASE (paper) 3.12 3.09 0.90 0.93 0.80 7.49
PASE (released) 3.08 3.21 0.91 0.94 0.80 6.76

It can be seen that the released version achieves performance very close to that of the paper's results on our simulated test set.

Overall, PASE achieves:

  • Lowest WER among evaluated generative and discriminative baselines
  • Highest speaker similarity (SpkSim)
  • Strong perceptual quality with low hallucination rates
  • Consistent performance across noisy and reverberant conditions

Bias, Risks, and Limitations

  • Model trained primarily on English speech; performance may degrade for other languages.
  • Very strong noise or mismatched reverberation conditions can introduce artifacts.
  • Speaker characteristics are preserved but not guaranteed perfectly.

Recommendations

Evaluate outputs for your specific use case. Avoid deployments where misunderstanding enhanced speech could have safety or legal consequences.


Citation

If you use PASE in your research, please cite:

@article{PASE, 
    title={{PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}},
    volume={40},
    DOI={10.1609/aaai.v40i39.40562}, 
    number={39}, 
    journal={Proceedings of the AAAI Conference on Artificial Intelligence}, 
    author={Rong, Xiaobin and Hu, Qinwen and Yesilbursa, Mansur and Wojcicki, Kamil and Lu, Jing}, 
    year={2026},
    month={Mar.}, 
    pages={32826-32834}
}

Copyright © 2026 by Cisco Systems, Inc. All rights reserved.

Model Card Authorship & Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cisco-ai/pase