Long-Term Spectro-Temporal and Static Harmonic Features for Voice Activity Detection

被引：38

作者：

Fukuda, Takashi ^{[1
]}

Ichikawa, Osamu ^{[1
]}

Nishimura, Masafumi ^{[1
]}

机构：

[1] IBM Res Tokyo, Yamato 2428502, Japan

来源：

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING | 2010年 / 4卷 / 05期

关键词：

Average phoneme duration; harmonic structure; long-term temporal information; voice activity detection (VAD); SPEECH;

D O I：

10.1109/JSTSP.2010.2069750

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Accurate voice activity detection (VAD) is important for robust automatic speech recognition (ASR) systems. This paper proposes a statistical-model-based noise-robust VAD algorithm using long-term temporal information and harmonic-structure-based features in speech. Long-term temporal information has recently become an ASR focus, but has not yet been deeply investigated for VAD. In this paper, we first consider the temporal features in a cepstral domain calculated over the average phoneme duration. In contrast, the harmonic structures are well-known bearers of acoustic information in human voices, but that information is difficult to exploit statistically. This paper further describes a new method to exploit the harmonic structure information with statistical models, providing additional noise robustness. The proposed method including both the long-term temporal and the static harmonic features led to considerable improvements under low SNR conditions, with 77.7% error reduction on average as compared with the ETSI AFE-VAD in our VAD testing. In addition, the word error rate was reduced by 29.1% in a test that included a full ASR system.

引用

页码：834 / 844

页数：11

共 26 条

[1]

[Anonymous], 2002, SPEECH PROCESSING TR

[2]

Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276, DOI 10.1109/97.957270

[3] EFFECT OF REDUCING SLOW TEMPORAL MODULATIONS ON SPEECH RECEPTION [J].

DRULLMAN, R ;

FESTEN, JM ;

PLOMP, R .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (05) :2670-2680

[4] EFFECT OF TEMPORAL ENVELOPE SMEARING ON SPEECH RECEPTION [J].

DRULLMAN, R ;

FESTEN, JM ;

PLOMP, R .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (02) :1053-1064

[5] SPEECH ENHANCEMENT USING A MINIMUM MEAN-SQUARE ERROR SHORT-TIME SPECTRAL AMPLITUDE ESTIMATOR [J].

EPHRAIM, Y ;

MALAH, D .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (06) :1109-1121

[6]

FUJIMOTO M, 2008, P 10 INT C SPOK LANG, P2008

[7]

FUKUDA T, 2008, P INT, P2262

[8]

GU L, 2001, P ICASSP, V1, P125

[9]

Guo Y., 2007, Proceedings of Interspeech, P2949

[10] Temporal patterns (TRAPs) in ASR of noisy speech [J].

Hermansky, H ;

Sharma, S .

ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, :289-292

← 1 2 3 →