Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments

被引:6
作者
Morita, Shota [1 ]
Unoki, Masashi [1 ]
Lu, Xugang [2 ]
Akagi, Masato [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Informat Sci, 1-1 Asahidai, Nomi, Ishikawa 9231292, Japan
[2] Natl Inst Informat & Commun Technol, Universal Commun Res Inst, 3-5 Hikaridai, Seika, Kyoto 6190289, Japan
来源
JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY | 2016年 / 82卷 / 02期
基金
日本学术振兴会;
关键词
Voice activity detection; Modulation transfer function; Noisy reverberant conditions; SNR estimation; Power thresholding; SPEECH; FEATURES;
D O I
10.1007/s11265-015-1014-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Voice activity detection (VAD) is used to detect speech and non-speech periods from observed speech signals. It is an important front-end technique for many speech technology applications. Many VAD methods have been proposed. However most of them have been applied under clean or noisy conditions. Only a few methods have been proposed for reverberant conditions, particularly under noisy reverberant conditions. We therefore need to understand the ill effects of noise and reverberation on speech to design an accurate and robust method of VAD under noisy reverberant conditions. The ill effects of noise and reverberation for speech can be regarded as the modulation transfer function (MTF) under noisy and reverberant conditions. Therefore, our study is based on the MTF concept to reduce the ill effects of noise and reverberation on speech, and propose a robust VAD method that we obtained in this study. Noise reduction and dereverberation were first applied to the temporal power envelope of the speech signal to restore the temporal power envelope with this method. Then, power thresholding as a VAD decision was designed based on the restored temporal power envelope. A method of estimating the signal to noise ratio (SNR) was proposed to accurately estimate the SNR in the noise reduction stage. Experiments under both artificial and realistic noisy reverberant conditions were carried out to evaluate the performance of the proposed method of VAD and it was compared with conventional VAD methods. The results revealed that the proposed method significantly outperformed the conventional methods under artificial and realistic noisy reverberant conditions.
引用
收藏
页码:163 / 173
页数:11
相关论文
共 20 条
[1]  
[Anonymous], 1999, 301 ETSI EN
[2]  
Architectual Institute of Japan, 2004, SOUND LIB ARCH ENV
[3]   ITU-T recommendation G.729 Annex B: A silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications [J].
Benyassine, A ;
Shlomot, E ;
Su, HY ;
Massaloux, D ;
Lamblin, C ;
Petit, JP .
IEEE COMMUNICATIONS MAGAZINE, 1997, 35 (09) :64-73
[4]   Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection [J].
Fukuda, Takashi ;
Ichikawa, Osamu ;
Nishimura, Masafumi .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (05) :834-844
[5]  
Hirsch H.G, 2000, P ASR2000 AUT SPEECH
[6]  
HOUTGAST T, 1973, ACUSTICA, V28, P66
[7]  
Kanai Y, 2013, INTERSPEECH, P742
[8]  
Kawai K., 2004, P ICA, P1561
[9]   CENSREC-1-C: An evaluation framework for voice activity detection under noisy environments [J].
Kitaoka, Norihide ;
Yamada, Takeshi ;
Tsuge, Satoru ;
Miyajima, Chiyomi ;
Yamamoto, Kazumasa ;
Nishiura, Takanobu ;
Nakayama, Masato ;
Denda, Yuki ;
Fujimoto, Masakiyo ;
Takiguchi, Tetsuya ;
Tamura, Satoshi ;
Matsuda, Shigeki ;
Ogawa, Tetsuji ;
Kuroiwa, Shingo ;
Takeda, Kazuya ;
Nakamura, Satoshi .
ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (05) :363-371
[10]  
Lu X., 2011, P INTERSPEECH2011, P2653