Convolutional Transformer based Local and Global Feature Learning for Speech Enhancement

被引:0
作者
Jannu, Chaitanya [1 ]
Vanambathina, Sunny Dayal [1 ]
机构
[1] VIT AP Univ, Sch Elect Engn, Amaravati, India
关键词
Convolutional neural network; recurrent neural network; speech enhancement; multi-head attention; two-stage convolutional transformer; feed-forward network; NEURAL-NETWORK; DILATED CONVOLUTIONS; RECOGNITION;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Speech enhancement (SE) is an important method for improving speech quality and intelligibility in noisy environments where received speech is severely distorted by noise. An efficient speech enhancement system relies on accurately modelling the long-term dependencies of noisy speech. Deep learning has greatly benefited by the use of transformers where long-term dependencies can be modelled more efficiently with multi-head attention (MHA) by using sequence similarity. Transformers frequently outperform recurrent neural network (RNN) and convolutional neural network (CNN) models in many tasks while utilizing parallel processing. In this paper we proposed a two-stage convolutional transformer for speech enhancement in time domain. The transformer considers global information as well as parallel computing, resulting in a reduction of long-term noise. In the proposed work unlike two -stage transformer neural network (TSTNN) different transformer structures for intra and inter transformers are used for extracting the local as well as global features of noisy speech. Moreover, a CNN module is added to the transformer so that short-term noise can be reduced more effectively, based on the ability of CNN to extract local information. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of STOI (short-time objective intelligibility), and PESQ (perceptual evaluation of the speech quality).
引用
收藏
页码:731 / 743
页数:13
相关论文
共 44 条
  • [1] Chen JJ, 2020, Arxiv, DOI arXiv:2007.13975
  • [2] Long short-term memory for speaker generalization in supervised speech separation
    Chen, Jitong
    Wang, DeLiang
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
  • [3] Choi H. -S., 2019, PHASE AWARE SPEECH
  • [4] Defossez A., 2020, arXiv, DOI DOI 10.48550/ARXIV.2006.12847
  • [5] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [6] Fu S.-W., 2017, Workshop of MLSP, P1
  • [7] Fu SW, 2019, PR MACH LEARN RES, V97
  • [8] Fu SW, 2017, ASIAPAC SIGN INFO PR, P6, DOI 10.1109/APSIPA.2017.8281993
  • [9] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1026 - 1034
  • [10] Kim J, 2020, INT CONF ACOUST SPEE, P6649, DOI [10.1109/icassp40776.2020.9053591, 10.1109/ICASSP40776.2020.9053591]