END-TO-END SOUND SOURCE ENHANCEMENT USING DEEP NEURAL NETWORK IN THE MODIFIED DISCRETE COSINE TRANSFORM DOMAIN

被引：0

作者：

Koizumi, Yuma ^{[1
]}

Harada, Noboru ^{[1
]}

Haneda, Yoichi ^{[2
]}

Hioka, Yusuke ^{[3
]}

Kobayashi, Kazunori ^{[1
]}

机构：

[1] Nippon Telegraph & Tel Corp, Media Intelligence Labs, Tokyo, Japan

[2] Univ Electrocommun, Tokyo, Japan

[3] Univ Auckland, Dept Mech Engn, Auckland, New Zealand

来源：

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2018年

关键词：

Sound source enhancement; modified discrete cosine transform (MDCT); deep learning; and end-to-end;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents an end-to-end deep neural network (DNN)-based source enhancement on the basis of a time-frequency (T-F) mask processing in the modified discrete cosine transform (MDCT)-domain. To retrieve the target signal perfectly in the discrete Fourier transform (DFT)-domain, both amplitude and phase of the spectrum need to be manipulated. However, since it is difficult to deal with complex values by neural network straightforward way, a real-valued T-F mask is commonly estimated and only amplitude spectrum is manipulated. In this study, we use the MDCT instead of the DFT and estimate real-valued T-F masks in the MDCT-domain. The perfect retrieval can be achieved by manipulating only the real-valued MDCT-spectra. To reduce time-domain aliasing arises from manipulating the MDCT spectrum, we build an end-to-end DNN-based source enhancement using T-F mask and train the DNN to minimize an objective function defined in the time-domain. In experiments using several kinds of objective sound quality scores, we observed that the scores were significantly improved.

引用

页码：706 / 710

页数：5

共 34 条

[1] [Anonymous], 2017, P INTERSPEECH
[2] [Anonymous], P ASRU
[3] [Anonymous], 2015, P ICASSP
[4] [Anonymous], 2005, Speech Enhancement
[5] [Anonymous], 2015, Optimization
[6] DeLiang Wang, 2008, Trends Amplif, V12, P332, DOI 10.1177/1084713808326455
[7] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR
EPHRAIM, Y
MALAH, D
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06): : 1109 - 1121
[8] Hershy J., 2016, P ICASSP
[9] Hioka Y., 2012, SPEECH COMMUN, P229
[10] Underdetermined Sound Source Separation Using Power Spectrum Density Estimated by Combination of Directivity Gain
Hioka, Yusuke
Furuya, Ken'ichi
Kobayashi, Kazunori
Niwa, Kenta
Haneda, Yoichi
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (06): : 1240 - 1250

← 1 2 3 4 →