Dual-stream Noise and Speech Information Perception based Speech Enhancement

被引:1
作者
Li, Nan [1 ]
Wang, Longbiao [1 ]
Zhang, Qiquan [2 ]
Dang, Jianwu [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China
[2] Univ New South Wales, Sydney, Australia
基金
中国国家自然科学基金;
关键词
Speech enhancement; Dual-stream; Attention; MMSE-LSA; ROBUST;
D O I
10.1016/j.eswa.2024.125432
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In real-world scenarios, dynamic ambient noise often degrades speech quality, highlighting the need for advanced speech enhancement techniques. Traditional methods, which rely on static embeddings as auxiliary features, struggle to address the complexities of varying noise conditions. To overcome this, we propose a Dual- stream Noise and Speech Information Perception (DNSIP) approach that dynamically detects and processes both noise and speech through innovative information extraction and suppression mechanisms. Initially, non- speech segments predominantly contain environmental noise, while speech segments carry information about the intended speaker. To handle this dynamic nature, real-time voice activity detection (VAD) is employed to accurately differentiate between speech and noise components. Building on VAD estimates, we propose an innovative information extraction framework that selectively extracts relevant noise and speech features from the noisy input, establishing a dual-stream network for concurrent noise and speech learning. To account for the temporal and spectral variability of noise and speech, a frequency-sequence attention mechanism is integrated, enhancing the model's ability to learn contextual and spectral dependencies. Additionally, an information suppression module is introduced to minimize cross-stream interference by attenuating noise within the speech stream and suppressing speech content within the noise stream. The derived noise and speech spectrograms are then utilized to formulate a minimum mean square error log-spectral amplitude (MMSE-LSA) estimator for robust speech enhancement. Experimental evaluations on the WSJ0 and VCTK+DEMAND datasets demonstrate that our DNSIP approach surpasses existing state-of-the-art methods, underscoring its efficacy in challenging acoustic environments.
引用
收藏
页数:12
相关论文
共 58 条
[1]  
[Anonymous], 2013, Speech Enhancement: Theory and Practice
[2]   CMGAN: Conformer-based Metric GAN for Speech Enhancement [J].
Cao, Ruizhe ;
Abdulatif, Sherif ;
Yang, Bin .
INTERSPEECH 2022, 2022, :936-940
[3]  
Choi H.-S., 2018, INT C LEARN REPR
[4]   Real Time Speech Enhancement in the Waveform Domain [J].
Defossez, Alexandre ;
Synnaeve, Gabriel ;
Adi, Yossi .
INTERSPEECH 2020, 2020, :3291-3295
[5]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR LOG-SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1985, 33 (02) :443-445
[6]   CompNet: Complementary network for single-channel speech enhancement [J].
Fan, Cunhang ;
Zhang, Hongmei ;
Li, Andong ;
Xiang, Wang ;
Zheng, Chengshi ;
Lv, Zhao ;
Wu, Xiaopei .
NEURAL NETWORKS, 2023, 168 :508-517
[7]  
Fu SW, 2019, PR MACH LEARN RES, V97
[8]   PercepNet plus : A Phase and SNR Aware PercepNet for Real-Time Speech Enhancement [J].
Ge, Xiaofeng ;
Han, Jiangyu ;
Long, Yanhua ;
Guan, Haixin .
INTERSPEECH 2022, 2022, :916-920
[9]   Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1026-1034
[10]   MMSE BASED NOISE PSD TRACKING WITH LOW COMPLEXITY [J].
Hendriks, Richard C. ;
Heusdens, Richard ;
Jensen, Jesper .
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, :4266-4269