Dense CNN With Self-Attention for Time-Domain Speech Enhancement

被引：6

作者：

Pandey, Ashutosh ^{[1
]}

Wang, DeLiang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷 / 29期

关键词：

Speech enhancement; Convolution; Time-domain analysis; Signal to noise ratio; Noise measurement; Training; Feature extraction; self-attention network; time-domain enhancement; dense convolutional network; frequency-domain loss; CONVOLUTIONAL NEURAL-NETWORK;

D O I：

10.1109/TASLP.2021.3064421

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.

引用

页码：1270 / 1279

页数：10

共 50 条

[1] Ba J., 2016, ARXIV160706450, V1050, P21
[2] Long short-term memory for speaker generalization in supervised speech separation
Chen, Jitong
Wang, DeLiang
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
[3] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises
Chen, Jitong
Wang, Yuxuan
Yoho, Sarah E.
Wang, DeLiang
Healy, Eric W.
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) : 2604 - 2612
[4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[5] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
[6] Fu SW, 2017, IEEE INT WORKS MACH
[7] End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
Fu, Szu-Wei
Wang, Tao-Wei
Tsao, Yu
Lu, Xugang
Kawai, Hisashi
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1570 - 1584
[8] SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement
Fu, Szu-Wei
Tsao, Yu
Lu, Xugang
[J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3768 - 3772
[9] Giri R, 2019, IEEE WORK APPL SIG, P249, DOI [10.1109/waspaa.2019.8937186, 10.1109/WASPAA.2019.8937186]
[10] He K, 2016, European conference on computer vision, P630, DOI [10.1007/978-3-319-46493-0_38, DOI 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]

← 1 2 3 4 5 →