Convolutional Transformer based Local and Global Feature Learning for Speech Enhancement

被引：0

作者：

Jannu, Chaitanya ^{[1
]}

Vanambathina, Sunny Dayal ^{[1
]}

机构：

[1] VIT AP Univ, Sch Elect Engn, Amaravati, India

来源：

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS | 2023年 / 14卷 / 01期

关键词：

Convolutional neural network; recurrent neural network; speech enhancement; multi-head attention; two-stage convolutional transformer; feed-forward network; NEURAL-NETWORK; DILATED CONVOLUTIONS; RECOGNITION;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Speech enhancement (SE) is an important method for improving speech quality and intelligibility in noisy environments where received speech is severely distorted by noise. An efficient speech enhancement system relies on accurately modelling the long-term dependencies of noisy speech. Deep learning has greatly benefited by the use of transformers where long-term dependencies can be modelled more efficiently with multi-head attention (MHA) by using sequence similarity. Transformers frequently outperform recurrent neural network (RNN) and convolutional neural network (CNN) models in many tasks while utilizing parallel processing. In this paper we proposed a two-stage convolutional transformer for speech enhancement in time domain. The transformer considers global information as well as parallel computing, resulting in a reduction of long-term noise. In the proposed work unlike two -stage transformer neural network (TSTNN) different transformer structures for intra and inter transformers are used for extracting the local as well as global features of noisy speech. Moreover, a CNN module is added to the transformer so that short-term noise can be reduced more effectively, based on the ability of CNN to extract local information. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of STOI (short-time objective intelligibility), and PESQ (perceptual evaluation of the speech quality).

引用

页码：731 / 743

页数：13

共 44 条

[1] Chen JJ, 2020, Arxiv, DOI arXiv:2007.13975
[2] Long short-term memory for speaker generalization in supervised speech separation
Chen, Jitong
Wang, DeLiang
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
[3] Choi H. -S., 2019, PHASE AWARE SPEECH
[4] Defossez A., 2020, arXiv, DOI DOI 10.48550/ARXIV.2006.12847
[5] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[6] Fu S.-W., 2017, Workshop of MLSP, P1
[7] Fu SW, 2019, PR MACH LEARN RES, V97
[8] Fu SW, 2017, ASIAPAC SIGN INFO PR, P6, DOI 10.1109/APSIPA.2017.8281993
[9] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1026 - 1034
[10] Kim J, 2020, INT CONF ACOUST SPEE, P6649, DOI [10.1109/icassp40776.2020.9053591, 10.1109/ICASSP40776.2020.9053591]

← 1 2 3 4 5 →