Multi-stream parameterization for structural speech recognition

被引:0
作者
Asakawa, Satoshi [1 ]
Minematsu, Nobuaki [1 ]
Hirose, Keikichi [2 ]
机构
[1] Univ Tokyo, Grad Sch Frontier Sci, Tokyo 1138654, Japan
[2] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138654, Japan
来源
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12 | 2008年
关键词
speech recognition; robust invariance; the structural representation; multiple stream structuralization;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, a novel and structural representation of speech was proposed [1, 2], where the inevitable acoustic variations caused by non-linguistic factors are effectively removed from speech. This structural representation captures only microphone- and speaker-invariant speech contrasts or dynamics and uses no absolute or static acoustic properties directly such as spectrums. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. The structural models trained with a single speaker outperformed the conventional HMMs trained with more than four thousand speakers even in the case of noisy speech. We also applied the new models to recognizing utterances of connected vowels [4]. In the current paper, a multiple stream structuralization method is proposed to improve the performance of the structural recognition framework. The proposed method only with 8 training speakers shows the very comparable performance to that of the conventional 4,130-speaker triphone-based HMMS.
引用
收藏
页码:4097 / +
页数:2
相关论文
共 12 条
[1]  
Asakawa S., 2007, P 8 ANN C INT SPEECH, P890
[2]  
Emori T., 2001, P EUR, P1649
[3]  
Gleason H.A., 1961, INTRO DESCRIPTIVE LI
[4]  
Kawahara T., 2004, P ICSLP, P3069
[5]  
Minematsu N, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P585
[6]  
Minematsu N, 2005, INT CONF ACOUST SPEE, P889
[7]  
MINEMATSU N, 2007, P SPRING M AC SOC JP, P147
[8]  
MINEMATSU N, 2006, P SRIV, P47
[9]  
MURAKAMI T, 2005, P ASRU, P203
[10]   Vocal tract normalization equals linear transformation in cepstral space [J].
Pitz, M ;
Ney, H .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (05) :930-944