Real-Time Single Channel Speech Enhancement Using Triple Attention and Stacked Squeeze-TCN

被引:2
作者
Jannu, Chaitanya [1 ]
Burra, Manaswini [2 ]
Vanambathina, Sunny Dayal [3 ]
Parisae, Veeraswamy [1 ]
机构
[1] NRI Inst Technol Autonomous, Dept CSE Data Sci, Agiripalli, India
[2] Potti Sriramulu Chalavadhi Mallikarjuna Rao Coll E, Dept CSE, Vijayawada, India
[3] VIT AP Univ, Sch Elect Engn, Amaravati, India
关键词
deep neural network (DNN); ideal ratio mask (IRM); triple attention block (TAB); NEURAL-NETWORK; SELF-ATTENTION; NOISE; DEREVERBERATION; RECOGNITION; SEPARATION;
D O I
10.1111/coin.70016
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech enhancement is crucial in many speech processing applications. Recently, researchers have been exploring ways to improve performance by effectively capturing the long-term contextual relationships within speech signals. Using multiple stages of learning, where several deep learning modules are activated one after the other, has been shown to be an effective approach. Recently, the attention mechanism has been explored for improving speech quality, showing significant improvements. The attention modules have been developed to improve CNNs backbone network performance. However, these attention modules often use fully connected (FC) and convolution layers, which increase the model's parameter count and computational requirements. The present study employs multi-stage learning within the framework of speech enhancement. The proposed study uses a multi-stage structure in which a sequence of Squeeze temporal convolutional modules (STCM) with twice dilation rates comes after a Triple attention block (TAB) at each stage. An estimate is generated at each phase and refined in the subsequent phase. To reintroduce the original information, a feature fusion module (FFM) is inserted at the beginning of each following phase. In the proposed model, the intermediate output can go through several phases of step-by-step improvement by continually unfolding STCMs, which eventually leads to the precise estimation of the spectrum. A TAB is crafted to enhance the model performance, allowing it to concurrently concentrate on areas of interest in the channel, spatial, and time-frequency dimensions. To be more specific, the CSA has two parallel regions combining channel with spatial attention, enabling both the channel dimension and the spatial dimension to be captured simultaneously. Next, the signal can be emphasized as a function of time and frequency by aggregating the feature maps along these dimensions. This improves its capability to model the temporal dependencies of speech signals. Using the VCTK and Librispeech datasets, the proposed speech enhancement system is assessed against state-of-the-art deep learning techniques and yielded better results in terms of PESQ, STOI, CSIG, CBAK, and COVL.
引用
收藏
页数:13
相关论文
共 68 条
[1]   MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation [J].
Abu Farha, Yazan ;
Gall, Juergen .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3570-3579
[2]  
Bai SJ, 2018, Arxiv, DOI [arXiv:1803.01271, DOI 10.48550/ARXIV.1803.01271]
[3]  
Brandstein M.S., 2001, Microphone arrays: signal processing techniques and applications, DOI DOI 10.1007/978-3-662-04619-7
[4]  
Fan CH, 2020, Arxiv, DOI arXiv:2003.07544
[5]  
Fu S.-W., 2019, PMLR, P2031
[6]   Sound Source Separation for Plural Passenger Speech Recognition in Smart Mobility System [J].
Fukui, Masahiro ;
Watanabe, Toshihiko ;
Kanazawa, Minato .
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2018, 64 (03) :399-405
[7]  
Giri R, 2019, IEEE WORK APPL SIG, P249, DOI [10.1109/waspaa.2019.8937186, 10.1109/WASPAA.2019.8937186]
[8]   Learning Spectral Mapping for Speech Dereverberation and Denoising [J].
Han, Kun ;
Wang, Yuxuan ;
Wang, DeLiang ;
Woods, William S. ;
Merks, Ivo ;
Zhang, Tao .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (06) :982-992
[9]  
Hao X., 2020, ICASSP 20202020 IEEE
[10]  
Hu J, 2018, PROC CVPR IEEE, P7132, DOI [10.1109/TPAMI.2019.2913372, 10.1109/CVPR.2018.00745]