Phase-Aware Speech Enhancement Based on Deep Neural Networks

被引:72
作者
Zheng, Naijun [1 ]
Zhang, Xiao-Lei [2 ]
机构
[1] Xidian Univ, Sch Telecommun Engn, State Key Lab Integrated Serv Networks, Xian 710071, Shaanxi, Peoples R China
[2] Northwestern Polytech Univ, Ctr Intelligent Acoust & Immers Commun, Sch Marine Sci & Technol, Xian 710072, Shaanxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep neural network (DNN); phase estimation; speech enhancement; instantaneous frequency; harmonic model; INTELLIGIBILITY; MASKS; NOISE;
D O I
10.1109/TASLP.2018.2870742
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Short-time frequency transform (STET) is fundamental in speech processing. Because of the difficulty of processing highly unstructured STFT phase, most speech-processing algorithms only operate with STFT magnitude, leaving the STFT phase far from explored. However, with the recent development of deep neural network (DNN) based speech processing, e.g., speech enhancement and recognition, phase processing is becoming more important than ever before as a new growing point of DNN-based methods. In this paper, we propose a phase-aware speech enhancement algorithm based on DNN. Specifically, in the training stage, when incorporating phase as a target, our core idea is to transform an unstructured phase spectrogram to its derivative along the time axis, i.e., instantaneous frequency deviation (IFD), which has a similar structure with its corresponding magnitude spectrogram. We further propose to optimize both IFD and magnitude jointly in a multiobjective learning framework. In the test stage, we propose a postprocessing method to recover the phase spectrogram from the estimated IFD. Experimental results demonstrate the effectiveness of the proposed method.
引用
收藏
页码:63 / 76
页数:14
相关论文
共 48 条
  • [1] [Anonymous], 2012, INT C ART INT STAT
  • [2] [Anonymous], 1993, NASA STIRECON TECH R
  • [3] [Anonymous], 2012, ABS12070580 CORR
  • [4] Benedetto J., 2013, APPL NUMERICAL HARMO
  • [5] Speech enhancement using state-based estimation and sinusoidal modeling
    Deisher, ME
    Spanias, AS
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1997, 102 (02) : 1141 - 1148
  • [6] Duchi J, 2011, J MACH LEARN RES, V12, P2121
  • [7] Improved MVDR beamforming using single-channel mask prediction networks
    Erdogan, Hakan
    Hershey, John
    Watanabe, Shinji
    Mandel, Michael
    Le Roux, Jonathan
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1981 - 1985
  • [8] Erdogan H, 2015, INT CONF ACOUST SPEE, P708, DOI 10.1109/ICASSP.2015.7178061
  • [9] Friedman D. H., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8), P1121
  • [10] Gaich A, 2015, INT CONF ACOUST SPEE, P216, DOI 10.1109/ICASSP.2015.7177963