RAT: RNN-Attention Transformer for Speech Enhancement

被引:0
作者
Zhang, Tailong [1 ]
He, Shulin [1 ]
Li, Hao [1 ]
Zhang, Xueliang [1 ]
机构
[1] Inner Mongolia Univ, Coll Comp Sci, Hohhot, Peoples R China
来源
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年
关键词
Speech enhancement; Transformer; Self-Attention; NOISE;
D O I
10.1109/ISCSLP57327.2022.10037952
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Benefiting from the global modeling capabilities of self-attention mechanisms, Transformer-based models have seen increasing use in natural language processing tasks and automatic speech recognition. The ultra-long sight of Transformer overcomes catastrophic forgetting in Recurrent Neural Networks (RNNs). However, unlike natural language processing and speech recognition tasks that focus on global information, speech enhancement focuses more on local information. Therefore, the original Transformer is not optimally suited to speech enhancement. In this paper, we propose an improved Transformer model called RNN-Attention Transformer (RAT), which applies multi-head self-attention (MHSA) to the temporal dimension. The input sequence is chunked and different models are applied intra-chunk and inter-chunks. Since RNNs are better at modeling local information than self-attention, RNNs and self-attention are used to model intra-chunk information and inter-chunks information, respectively. Experiments show that RAT significantly reduces parameters and improves performance compared to the baseline.
引用
收藏
页码:463 / 467
页数:5
相关论文
共 25 条
[1]   SUPPRESSION OF ACOUSTIC NOISE IN SPEECH USING SPECTRAL SUBTRACTION [J].
BOLL, SF .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1979, 27 (02) :113-120
[2]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[3]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[4]  
Hao-Teng Fan, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P4483, DOI 10.1109/ICASSP.2014.6854450
[5]   DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement [J].
Hu, Yanxin ;
Liu, Yun ;
Lv, Shubo ;
Xing, Mengtao ;
Zhang, Shimin ;
Fu, Yihui ;
Wu, Jian ;
Zhang, Bihong ;
Xie, Lei .
INTERSPEECH 2020, 2020, :2472-2476
[6]  
Kim J, 2020, INT CONF ACOUST SPEE, P6649, DOI [10.1109/ICASSP40776.2020.9053591, 10.1109/icassp40776.2020.9053591]
[7]  
Koizumi Y, 2020, Arxiv, DOI arXiv:2002.05873
[8]  
Le Roux J, 2019, INT CONF ACOUST SPEE, P626, DOI 10.1109/ICASSP.2019.8683855
[9]  
Li H, 2020, INT CONF ACOUST SPEE, P4722, DOI [10.1109/ICASSP40776.2020.9054049, 10.1109/icassp40776.2020.9054049]
[10]   Jointly Optimizing Activation Coefficients of Convolutive NMF Using DNN for Speech Separation [J].
Li, Hao ;
Nie, Shuai ;
Zhang, Xueliang ;
Zhang, Hui .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :550-554