Multi-stage temporal representation learning via global and local perspectives for real-time speech enhancement

被引:1
作者
Chau, Hoang Ngoc [1 ]
Linh, Nguyen Thi Nhat [1 ]
Doan, Tuan Kiet [1 ]
Nguyen, Quoc Cuong [1 ]
机构
[1] Hanoi Univ Sci & Technol, Sch Elect & Elect Engn, Hanoi 100000, Vietnam
关键词
Speech enhancement; Deep learning-based; Global and local modeling; Self-attention; Graph convolution; NEURAL-NETWORK; DOMAIN; BEAMFORMER; ATTENTION;
D O I
10.1016/j.apacoust.2024.110067
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep learning-based speech enhancement algorithms have been rapidly developed over the past few years. Although numerous approaches have been proposed, global and local information from speech features have not been thoroughly investigated. In this paper, we introduce a novel and highly effective speech enhancement network called Multi-stage Global-Local Network (MSGLN), which exploits both local and global information via temporal self-attention, temporal graph convolution, and 1D convolution. Local modeling blocks capture the fast changes in speech signals, while global modeling blocks learn long-term trends in noise or speech signals through factors such as pitch, tone, resonance, timbre, and rhythm. In addition, we propose a multi-stage temporal processing module as the bottleneck of a complex convolutional encoder-decoder structure to guide our network to learn different acoustic structures from different scales. Then a dual-path RNN postprocessing module is integrated to reconstruct the speech spectrum mask using a frequency-wise temporal refinement block followed by a frame-wise spectral refinement block. Experimental results demonstrate the superior performance of our proposed methodology compared to other state-of-the-arts on both real-time single- and multi-channel speech enhancement tasks.
引用
收藏
页数:10
相关论文
共 79 条
  • [1] Braun Sebastian, 2020, Speech and Computer. 22nd International Conference, SPECOM 2020. Proceedings. Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science (LNAI 12335), P79, DOI 10.1007/978-3-030-60276-5_8
  • [2] Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
  • [3] A Novel Approach to Multi-Channel Speech Enhancement Based on Graph Neural Networks
    Chau, Hoang Ngoc
    Bui, Tien Dat
    Nguyen, Huu Binh
    Duong, Thanh Thi Hien
    Nguyen, Quoc Cuong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1133 - 1144
  • [4] Speech Enhancement with Fullband-Subband Cross-Attention Network
    Chen, Jun
    Rao, Wei
    Wang, Zilin
    Wu, Zhiyong
    Wang, Yannan
    Yu, Tao
    Shang, Shidong
    Meng, Helen
    [J]. INTERSPEECH 2022, 2022, : 976 - 980
  • [5] FullSubNet plus : CHANNEL ATTENTION FULLSUBNET WITH COMPLEX SPECTROGRAMS FOR SPEECH ENHANCEMENT
    Chen, Jun
    Wang, Zilin
    Tuo, Deyi
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7857 - 7861
  • [6] MULTI-STAGE AND MULTI-LOSS TRAINING FOR FULLBAND NON-PERSONALIZED AND PERSONALIZED SPEECH ENHANCEMENT
    Chen, Lianwu
    Xu, Chenglin
    Zhang, Xu
    Ren, Xinlei
    Zheng, Xiguang
    Zhang, Chen
    Guo, Liang
    Yu, Bing
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9296 - 9300
  • [7] Decoupling-style monaural speech enhancement with a triple-branch cross-domain fusion network
    Chen, Wenzhuo
    Yu, Runxiang
    Ye, Zhongfu
    [J]. APPLIED ACOUSTICS, 2024, 217
  • [8] Chung JY, 2014, Arxiv, DOI [arXiv:1412.3555, 10.48550/arXiv.1412.3555]
  • [9] ICASSP 2022 ACOUSTIC ECHO CANCELLATION CHALLENGE
    Cutler, Ross
    Saabas, Ando
    Parnamaa, Tanel
    Purin, Marju
    Gamper, Hannes
    Braun, Sebastian
    Sorensen, Karsten
    Aichner, Robert
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 9107 - 9111
  • [10] Fundamentals, present and future perspectives of speech enhancement
    Das, Nabanita
    Chakraborty, Sayan
    Chaki, Jyotismita
    Padhy, Neelamadhab
    Dey, Nilanjan
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (04) : 883 - 901