Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition

被引:0
作者
Chung-Hsien Wu
Gwo-Lang Yan
机构
[1] National Cheng Kung University,Department of Computer Science and Information Engineering
来源
Journal of VLSI signal processing systems for signal, image and video technology | 2004年 / 36卷
关键词
filled pause; disfluency; Guassian mixture model; speech recognition; Karhunen-Loéve transform; linear discriminant analysis;
D O I
暂无
中图分类号
学科分类号
摘要
Most automatic speech recognizers (ASRs) concentrate on read speech, which is different from spontaneous speech with disfluencies. ASRs cannot deal with speech with a high rate of disfluencies such as filled pauses, repetitions, lengthening, repairs, false starts and silence pauses. In this paper, we focus on the feature analysis and modeling of the filled pauses “ah,” “ung,” “um,” “em,” and “hem” in spontaneous speech. Karhunen-Loéve transform (KLT) and linear discriminant analysis (LDA) were adopted to select discriminant features for filled pause detection. In order to suitably determine the number of discriminant features, Bartlett hypothesis testing was adopted. Twenty-six features were selected using Bartlett hypothesis testing. Gaussian mixture models (GMMs), trained with a gradient decent algorithm, were used to improve the filled pause detection performance. The experimental results show that the filled pause detection rates using KLT and LDA were 84.4% and 86.8%, respectively. A significant improvement was obtained in the filled pause detection rate using the discriminative GMM with KLT and LDA. In addition, the LDA features outperformed the KLT features in the detection of filled pauses.
引用
收藏
页码:91 / 104
页数:13
相关论文
共 18 条
[1]  
Stolcke A.(1996)Statistical Language Model for Speech Disfluencies Proc. of ICASSP-96 1 405-408
[2]  
Shriberg E.(1996)Modeling Disfluencies in Conversation Speech Proc. of ICSLP-96 1 386-389
[3]  
Siu M.(2000)Variable N-Grams and Extensions for Conversational Speech Language Modeling IEEE Trans. Speech and Audio Processing 8 63-75
[4]  
Ostendorf M.(2000)Linguistic Properties of Non-Native Speech Proc. of ICASSP-2000 3 1335-1338
[5]  
Siu M.(1996)Filled Pauses as Markers of Discourse Structure Proc. ICSLP-96 2 1033-1036
[6]  
Ostendorf M.(1992)Recognition of Hesitations in Spontaneous Speech Proc. of ICASSP-92 1 521-524
[7]  
Tomokiyo L.M.(1996)Some Acoustic Feature of Nasal and Nasalized Vowels: A Target for Vowel Nasalization J. Acoust. Soc. Am. 99 3694-3706
[8]  
Swerts M.(1997)Acoustic Correlates of English and French Nasalized Vowels J. Acoust. Soc. Am. 102 2360-2370
[9]  
Wichmann A.(1962)Analysis of Nasal Consonants J. Acoust. Soc. Am. 34 1865-1875
[10]  
Beun R.J.(1983)Place Cues for Nasal Consonants with Special Reference to Catalan J. Acoust. Soc. Am. 73 1346-1353