Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

被引:38
作者
Fukuda, Takashi [1 ]
Ichikawa, Osamu [1 ]
Nishimura, Masafumi [1 ]
机构
[1] IBM Res Tokyo, Yamato 2428502, Japan
关键词
Average phoneme duration; harmonic structure; long-term temporal information; voice activity detection (VAD); SPEECH;
D O I
10.1109/JSTSP.2010.2069750
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.
引用
收藏
页码:834 / 844
页数:11
相关论文
共 26 条
[1]  
[Anonymous], 2002, SPEECH PROCESSING TR
[2]  
Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276, DOI 10.1109/97.957270
[3]   EFFECT OF REDUCING SLOW TEMPORAL MODULATIONS ON SPEECH RECEPTION [J].
DRULLMAN, R ;
FESTEN, JM ;
PLOMP, R .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (05) :2670-2680
[4]   EFFECT OF TEMPORAL ENVELOPE SMEARING ON SPEECH RECEPTION [J].
DRULLMAN, R ;
FESTEN, JM ;
PLOMP, R .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (02) :1053-1064
[5]   SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].
EPHRAIM, Y ;
MALAH, D .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121
[6]  
FUJIMOTO M, 2008, P 10 INT C SPOK LANG, P2008
[7]  
FUKUDA T, 2008, P INT, P2262
[8]  
GU L, 2001, P ICASSP, V1, P125
[9]  
Guo Y., 2007, Proceedings of Interspeech, P2949
[10]   Temporal patterns (TRAPs) in ASR of noisy speech [J].
Hermansky, H ;
Sharma, S .
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, :289-292