This repository provides the official PyTorch implementation of the following paper:
Robust One-step Speech Enhancement via Consistency Distillation (IEEE WASPAA 2025, Oral Presentation)
Liang Xu, Longfei Felix Yan, W. Bastiaan Kleijn
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025
🔗 Project Website | 📄 arXiv Preprint | 📄 IEEE Xplore
- Real-Time Efficiency: Proposes a one-step consistency training (CT) framework for highly efficient, real-time speech enhancement.
- Improved Robustness: Mitigates the accumulation of teacher-induced biases via randomized trajectory training and auxiliary time-domain constraints.
- Superior Performance: Accelerates inference speed by a factor of 54× while surpassing the performance of the foundational 30-step teacher model.
- Strong Generalization: Demonstrates robust generalization capabilities across out-of-domain and dynamic real-world acoustic scenarios.
The table below presents a comparative analysis between the proposed 1-step Consistency Training (CT) model and the baseline 30-step Teacher model, evaluated on the VoiceBank-DEMAND test corpus. The CT framework not only accelerates inference by an order of magnitude but also yields statistically significant improvements across all established objective metrics.
| Model | Steps | PESQ (↑) | ESTOI (↑) | SI-SDR (↑) | SI-SIR (↑) | SI-SAR (↑) |
|---|---|---|---|---|---|---|
| Teacher | 30 | 2.89 ± 0.67 | 0.86 ± 0.10 | 16.7 ± 3.7 | 26.7 ± 5.8 | 17.6 ± 3.4 |
| CT (Ours) | 1 | 3.47 ± 0.67 | 0.87 ± 0.10 | 19.2 ± 3.6 | 29.2 ± 5.4 | 20.0 ± 3.7 |
We release the pre-trained checkpoints alongside corresponding enhanced audio outputs for both the 30-step teacher model and our 1-step CT model.
- Checkpoints: Download via Google Drive
- Enhanced Audio Outputs: Download via Google Drive
Usage instructions:
Extract and place the downloaded checkpoints into the designated logs/ directory (e.g., ./logs/). Ensure that the checkpoint paths within the evaluation scripts (scripts/eval_CT.sh or scripts/eval_teacher.sh) are correctly updated to replicate the reported benchmark results.
We recommend utilizing an isolated virtual environment with Python 3.11. To initialize the environment and install dependencies, execute:
# Clone the repository
git clone https://github.com/liangxu123/rosecd.git
cd rosecd
# Install required packages
pip install -r requirements.txtNote: For experiment tracking via Weights & Biases (W&B), please configure your environment using wandb login prior to initiating training.
Our data preprocessing pipeline is adapted from the established SGMSE+ framework. To configure the dataset directories, please update the corresponding paths in path_config.sh. By default, the configuration points to the VoiceBank-DEMAND corpus paths.
Consistency Models inherently support two distinct training paradigms:
- Consistency Distillation (CD): Distilling knowledge from a pre-trained teacher diffusion model.
- Consistency Training (CT): Direct training on the empirical data distribution without the necessity of a teacher model.
While our published paper primarily formalizes and evaluates the method utilizing Consistency Distillation (CD), subsequent empirical analyses revealed that applying Consistency Training (CT) within the exact same codebase yields identical performance. Because CT completely bypasses the need to rely on a pre-trained teacher model, it significantly streamlines the training procedure and circumvents teacher-induced approximation errors.
Therefore, in this repository, we officially release both the Teacher model training scripts and the CT model training scripts, as CT achieves the same state-of-the-art one-step enhancement performance as CD, but with a much simpler pipeline.
To train the one-step model from scratch using CT, execute:
bash ./scripts/train_CT.sh <GPU_ID> # e.g., bash ./scripts/train_CT.sh 0To train the multi-step Teacher model (if you wish to reproduce the baseline or teacher pipeline), execute:
bash ./scripts/train_teacher.sh <GPU_ID> # e.g., bash ./scripts/train_teacher.sh 0To benchmark the one-step consistency model on the test corpus, utilize the evaluation scripts provided in the scripts/ directory.
Prior to execution, verify that the checkpoint path inside scripts/eval_CT.sh correctly points to your trained model weights.
bash ./scripts/eval_CT.sh <GPU_ID> # e.g., bash ./scripts/eval_CT.sh 0This procedure generates the enhanced audio files within out/onestep_pesq5e-4_CT/N_1 and subsequently computes standard objective metrics (e.g., PESQ, ESTOI, SI-SDR).
Evaluating the Baseline Teacher Model:
If a baseline multi-step Teacher model was trained (via scripts/train_teacher.sh), it can be evaluated by updating the checkpoint path within scripts/eval_teacher.sh and running:
bash ./scripts/eval_teacher.sh <GPU_ID> # e.g., bash ./scripts/eval_teacher.sh 0Enhanced outputs will be saved to out/teacher/N_30 and evaluated automatically.
If this codebase or methodology proves useful in your research, please consider citing our work:
@inproceedings{xu2025robust,
title={Robust One-step Speech Enhancement via Consistency Distillation},
author={Xu, Liang and Yan, Longfei Felix and Kleijn, W Bastiaan},
booktitle={2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
pages={1--5},
year={2025},
organization={IEEE},
keywords={Noise;Speech enhancement;Robustness;Real-time systems;Trajectory;Recording;Noise measurement;Iterative methods;Time-domain analysis;Optimization},
doi={10.1109/WASPAA66052.2025.11230988}
}We express our gratitude to the authors of the SGMSE+ repository, upon whose exemplary foundational work this codebase is built.