END-TO-END SOUND SOURCE ENHANCEMENT USING DEEP NEURAL NETWORK IN THE MODIFIED DISCRETE COSINE TRANSFORM DOMAIN

被引:0
作者
Koizumi, Yuma [1 ]
Harada, Noboru [1 ]
Haneda, Yoichi [2 ]
Hioka, Yusuke [3 ]
Kobayashi, Kazunori [1 ]
机构
[1] Nippon Telegraph & Tel Corp, Media Intelligence Labs, Tokyo, Japan
[2] Univ Electrocommun, Tokyo, Japan
[3] Univ Auckland, Dept Mech Engn, Auckland, New Zealand
来源
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年
关键词
Sound source enhancement; modified discrete cosine transform (MDCT); deep learning; and end-to-end;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents an end-to-end deep neural network (DNN)-based source enhancement on the basis of a time-frequency (T-F) mask processing in the modified discrete cosine transform (MDCT)-domain. To retrieve the target signal perfectly in the discrete Fourier transform (DFT)-domain, both amplitude and phase of the spectrum need to be manipulated. However, since it is difficult to deal with complex values by neural network straightforward way, a real-valued T-F mask is commonly estimated and only amplitude spectrum is manipulated. In this study, we use the MDCT instead of the DFT and estimate real-valued T-F masks in the MDCT-domain. The perfect retrieval can be achieved by manipulating only the real-valued MDCT-spectra. To reduce time-domain aliasing arises from manipulating the MDCT spectrum, we build an end-to-end DNN-based source enhancement using T-F mask and train the DNN to minimize an objective function defined in the time-domain. In experiments using several kinds of objective sound quality scores, we observed that the scores were significantly improved.
引用
收藏
页码:706 / 710
页数:5
相关论文
共 34 条
  • [1] [Anonymous], 2017, P INTERSPEECH
  • [2] [Anonymous], P ASRU
  • [3] [Anonymous], 2015, P ICASSP
  • [4] [Anonymous], 2005, Speech Enhancement
  • [5] [Anonymous], 2015, Optimization
  • [6] DeLiang Wang, 2008, Trends Amplif, V12, P332, DOI 10.1177/1084713808326455
  • [7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
    EPHRAIM, Y
    MALAH, D
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121
  • [8] Hershy J., 2016, P ICASSP
  • [9] Hioka Y., 2012, SPEECH COMMUN, P229
  • [10] Underdetermined Sound Source Separation Using Power Spectrum Density Estimated by Combination of Directivity Gain
    Hioka, Yusuke
    Furuya, Ken'ichi
    Kobayashi, Kazunori
    Niwa, Kenta
    Haneda, Yoichi
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (06): : 1240 - 1250