Dense CNN With Self-Attention for Time-Domain Speech Enhancement

被引:6
作者
Pandey, Ashutosh [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
关键词
Speech enhancement; Convolution; Time-domain analysis; Signal to noise ratio; Noise measurement; Training; Feature extraction; self-attention network; time-domain enhancement; dense convolutional network; frequency-domain loss; CONVOLUTIONAL NEURAL-NETWORK;
D O I
10.1109/TASLP.2021.3064421
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech enhancement in the time domain is becoming increasingly popular in recent years, due to its capability to jointly enhance both the magnitude and the phase of speech. In this work, we propose a dense convolutional network (DCN) with self-attention for speech enhancement in the time domain. DCN is an encoder and decoder based architecture with skip connections. Each layer in the encoder and the decoder comprises a dense block and an attention module. Dense blocks and attention modules help in feature extraction using a combination of feature reuse, increased network depth, and maximum context aggregation. Furthermore, we reveal previously unknown problems with a loss based on the spectral magnitude of enhanced speech. To alleviate these problems, we propose a novel loss based on magnitudes of enhanced speech and a predicted noise. Even though the proposed loss is based on magnitudes only, a constraint imposed by noise prediction ensures that the loss enhances both magnitude and phase. Experimental results demonstrate that DCN trained with the proposed loss substantially outperforms other state-of-the-art approaches to causal and non-causal speech enhancement.
引用
收藏
页码:1270 / 1279
页数:10
相关论文
共 50 条
  • [1] Ba J., 2016, ARXIV160706450, V1050, P21
  • [2] Long short-term memory for speaker generalization in supervised speech separation
    Chen, Jitong
    Wang, DeLiang
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2017, 141 (06) : 4705 - 4714
  • [3] Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises
    Chen, Jitong
    Wang, Yuxuan
    Yoho, Sarah E.
    Wang, DeLiang
    Healy, Eric W.
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2016, 139 (05) : 2604 - 2612
  • [4] Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
  • [5] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [6] Fu SW, 2017, IEEE INT WORKS MACH
  • [7] End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks
    Fu, Szu-Wei
    Wang, Tao-Wei
    Tsao, Yu
    Lu, Xugang
    Kawai, Hisashi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (09) : 1570 - 1584
  • [8] SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement
    Fu, Szu-Wei
    Tsao, Yu
    Lu, Xugang
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3768 - 3772
  • [9] Giri R, 2019, IEEE WORK APPL SIG, P249, DOI [10.1109/waspaa.2019.8937186, 10.1109/WASPAA.2019.8937186]
  • [10] He K, 2016, European conference on computer vision, P630, DOI [10.1007/978-3-319-46493-0_38, DOI 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]