Simultaneous Recognition of Distant-Talking Speech of Multiple Talkers Based on the 3-D N-Best Search Method

被引:0
作者
Panikos Heracleous
Satoshi Nakamura
Kiyohiro Shikano
机构
[1] ATR Spoken Language Translation Research Labs,Graduate School of Information Science
[2] Nara Institute of Science and Technology,undefined
来源
Journal of VLSI signal processing systems for signal, image and video technology | 2004年 / 36卷
关键词
speech recognition; distant-talking speech; multiple sound sources; microphone array;
D O I
暂无
中图分类号
学科分类号
摘要
This paper describes a novel method for hands-free speech recognition and in particular for simultaneous recognition of distant-talking speech of multiple sound sources (talkers or noise sources). Our method is based on the 3-D Viterbi search extended to a 3-D N-best search method to allow simultaneous speech recognition of multiple talkers. The baseline system integrates two existing technologies—3-D Viterbi search and conventional N-best search—into a complete system. However, initial evaluation of the 3-D N-best search-based system showed that new ideas were needed in order to build a system to simultaneously recognize multiple sound sources. Two factors were found to have an important role in system performance. Those two factors are the different likelihood ranges of the talkers and the direction-based separation of the hypotheses. More specifically, since we have to compare hypotheses originating from different talkers, an accurate comparison of these hypotheses cannot be made due to the different likelihood dynamic range of the talkers. Moreover, the hypotheses originated from talkers are located in different directions and therefore separating them based on their direction provides an efficient method for accurate recognition. To solve these problems, we implemented a likelihood normalization technique and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated by experiments for recognizing the distant-talking speech of two talkers. The experiments were carried out on simulated (with only time delay) data and on reverberated (simulated and real) data. In this paper, we evaluated the proposed method in reverberant environments, and we introduced results obtained by experiments at several reverberation times and results obtained in a real environment. The experiments showed that implementing the two techniques described above produced significant improvements. Best results for simulated data were obtained by implementing the two techniques and using a microphone array composed of 32 channels. In that case in particular, the Simultaneous Word Accuracy (where both talkers are correctly recognized simultaneously) in the ‘top 1’ hypothesis was 72.49%, and in the ‘top 3’ hypotheses was 86.25%, which were very promising results.
引用
收藏
页码:105 / 116
页数:11
相关论文
共 4 条
[1]  
Matsui T.(1995)Likelihood Normalization for Speaker Verification using a Phoneme-and Speaker-independent Model Speech Communication 17 109-116
[2]  
Furui S.(1979)Image Method for Efficiently Simulating Small-Room Acoustics Journal of Acoustical Society of America 65 943-950
[3]  
Allen J.B.(undefined)undefined undefined undefined undefined-undefined
[4]  
Berkley D.A.(undefined)undefined undefined undefined undefined-undefined