A Composite Predictive-Generative Approach to Monaural Universal Speech Enhancement

被引:0
作者
Zhang, Jie [1 ]
Yan, Haoyin [1 ]
Li, Xiaofei [2 ]
机构
[1] Univ Sci & Technol China USTC, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei 230027, Peoples R China
[2] Westlake Univ, Sch Engn, Hangzhou 310030, Peoples R China
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2025年 / 33卷
关键词
Predictive models; Diffusion models; Computational modeling; Speech enhancement; Training; Diffusion processes; Stochastic processes; Standards; Noise reduction; Image reconstruction; Computational complexity; diffusion model; generative-predictive modeling; universal speech enhancement (USE); SEPARATION; MODELS; PHASE;
D O I
10.1109/TASLPRO.2025.3577387
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
It is promising to design a single model that can suppress various distortions and improve speech quality, i.e., universal speech enhancement (USE). Compared to supervised learning-based predictive methods, diffusion-based generative models have shown greater potential due to the generative capacities from degraded speech with severely damaged information. However, artifacts may be introduced in highly adverse conditions, and diffusion models often suffer from a heavy computational burden due to many steps for inference. In order to jointly leverage the superiority of prediction and generation and overcome the respective defects, in this work we propose a universal speech enhancement model called PGUSE by combining predictive and generative modeling. Our model consists of two branches: the predictive branch directly predicts clean samples from degraded signals, while the generative branch optimizes the denoising objective of diffusion models. We utilize the output fusion and truncated diffusion scheme to effectively integrate predictive and generative modeling, where the former directly combines results from both branches and the latter modifies the reverse diffusion process with initial estimates from the predictive branch. Extensive experiments on several datasets verify the superiority of the proposed model over state-of-the-art baselines, demonstrating the complementarity and benefits of combining predictive and generative modeling.
引用
收藏
页码:2312 / 2325
页数:14
相关论文
共 83 条
[1]   CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement [J].
Abdulatif, Sherif ;
Cao, Ruizhe ;
Yang, Bin .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 :2477-2493
[2]  
Anderson B. D. O., 1982, Stochastic Processes and their Applications, V12, P313, DOI DOI 10.1016/0304-4149(82)90051-5
[3]  
Baevski A, 2020, ADV NEUR IN, V33
[4]   TOWARDS EFFICIENT MODELS FOR REAL-TIME DEEP NOISE SUPPRESSION [J].
Braun, Sebastian ;
Gamper, Hannes ;
Reddy, Chandan K. A. ;
Tashev, Ivan .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :656-660
[5]   A consolidated view of loss functions for supervised deep learning-based speech enhancement [J].
Braun, Sebastian ;
Tashev, Ivan .
2021 44TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2021, :72-76
[6]  
Chen Z, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3274
[7]  
Chinen M., 2020, P 12 INT C QUAL MULT, P1
[8]   Xception: Deep Learning with Depthwise Separable Convolutions [J].
Chollet, Francois .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1800-1807
[9]   DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT [J].
Dang, Feng ;
Chen, Hangting ;
Zhangt, Pengyuan .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6857-6861
[10]  
Dhariwal P, 2021, ADV NEUR IN, V34