Robust Distant Speech Recognition by Combining Multiple Microphone-Array Processing with Position-Dependent CMN

被引:0
作者
Longbiao Wang
Norihide Kitaoka
Seiichi Nakagawa
机构
[1] Toyohashi University of Technology,Department of Information and Computer Sciences
来源
EURASIP Journal on Advances in Signal Processing | / 2006卷
关键词
Word Recognition; Speech Recognition; Recognition Performance; Multiple Channel; Lower Computational Cost;
D O I
暂无
中图分类号
学科分类号
摘要
We propose robust distant speech recognition by combining multiple microphone-array processing with position-dependent cepstral mean normalization (CMN). In the recognition stage, the system estimates the speaker position and adopts compensation parameters estimated a priori corresponding to the estimated position. Then the system applies CMN to the speech (i.e., position-dependent CMN) and performs speech recognition for each channel. The features obtained from the multiple channels are integrated with the following two types of processings. The first method is to use the maximum vote or the maximum summation likelihood of recognition results from multiple channels to obtain the final result, which is called multiple-decoder processing. The second method is to calculate the output probability of each input at frame level, and a single decoder using these output probabilities is used to perform speech recognition. This is called single-decoder processing, resulting in lower computational cost. We combine the delay-and-sum beamforming with multiple-decoder processing or single-decoder processing, which is termed multiple microphone-array processing. We conducted the experiments of our proposed method using a limited vocabulary (100 words) distant isolated word recognition in a real environment. The proposed multiple microphone-array processing using multiple decoders with position-dependent CMN achieved a 3.2% improvement (50% relative error reduction rate) over the delay-and-sum beamforming with conventional CMN (i.e., the conventional method). The multiple microphone-array processing using a single decoder needs about one-third the computational time of that using multiple decoders without degrading speech recognition performance.
引用
收藏
相关论文
共 38 条
[1]  
Hughes TB(1999)Performance of an HMM speech recognizer using a real-time tracking microphone array as input IEEE Transactions on Speech and Audio Processing 7 346-349
[2]  
Kim H-S(2001)HMM-separation-based speech recognition for a distant moving speaker IEEE Transactions on Speech and Audio Processing 9 127-140
[3]  
DiBiase JH(2004)Likelihood-maximizing beamforming for robust hands-free speech recognition IEEE Transactions on Speech and Audio Processing 12 489-498
[4]  
Silverman HF(1981)Cepstral analysis technique for automatic speaker verification IEEE Transactions on Acoustics, Speech, and Signal Processing 29 254-272
[5]  
Takiguchi T(2003)Robust adaptive time delay estimation for speaker localization in noisy and reverberant acoustic environments EURASIP Journal on Applied Signal Processing 2003 1110-1124
[6]  
Nakamura S(1976)The generalized correlation method for estimation of time delay IEEE Transactions on Acoustics, Speech, and Signal Processing 24 320-327
[7]  
Shikano K(1996)Acoustic source location in noisy and reverberant environment using CSP analysis Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96) 2 921-924
[8]  
Seltzer ML(1988)Beamforming: a versatile approach to spatial filtering IEEE ASSP Magazine 5 4-24
[9]  
Raj B(2002)Distant-talking speech recognition based on a 3-D Viterbi search using a microphone array IEEE Transactions on Speech and Audio Processing 10 48-56
[10]  
Stern RM(1985)Computer-steered microphone arrays for sound transduction in large rooms The Journal of the Acoustical Society of America 78 1508-1518