Local normalization and delayed decision making in speaker detection and tracking

被引:5
作者
Koolwaaij, J [1 ]
Boves, L [1 ]
机构
[1] Univ Nijmegen, NL-6500 HD Nijmegen, Netherlands
关键词
decision making; segmentation; speaker detection; speaker tracking;
D O I
10.1006/dspr.1999.0357
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This paper describes A2RT's speaker detection and tracking system and its performance on the 1999 NIST speaker recognition evaluation data. The system does not consist of concatenated modules such as, for instance, silence-speech detection, handset and gender detection, and finally speaker detection or tracking, where each module builds on the hard decisions from previous modules, but rather applies the principle of delayed decision making and postpones all hard decisions until the final stage of the detection process. This paper focuses on two important locality issues in detecting or tracking speakers in a telephone conversation, for which the speaker change frequency is usually high. First, channel estimation needs sufficiently long but homogeneous segments. Several kinds of local channel normalization are compared in this paper. Second, local estimation of speaker likelihoods critically depends on the segmentation of the conversation. Our experiments show that a global level of segmentation really improves speaker tracking performance, whereas a more detailed segmentation is needed for speaker detection, because likelihood computation over clusters of segments depends on the purity of the segments. Furthermore, choosing the appropriate type of channel normalization can give a small but consistent improvement in speaker tracking performance, (C) 2000 Academic Press.
引用
收藏
页码:113 / 132
页数:20
相关论文
共 9 条
[1]  
[Anonymous], P 5 ICSLP
[2]  
CHEN SS, 1998, P DARPA WORKSH
[3]  
Gish H., 1991, P IEEE INT C AC SPEE, P873
[4]  
HAIN T, 1998, P DARPA WORKSH
[5]  
KOOLWAAIJ JW, 1999, P NIST SPEAK REC WOR
[6]   The NIST 1999 Speaker Recognition Evaluation - An overview [J].
Martin, A ;
Przybocki, M .
DIGITAL SIGNAL PROCESSING, 2000, 10 (1-3) :1-18
[7]  
PRZYBOCKI MA, 1999, P EUR C SPEECH TECHN, P2215
[8]  
REYNOLDS D, 1996, P IEEE INT C AC SPEE, P113
[9]   Cepstral domain segmental feature vector normalization for noise robust speech recognition [J].
Viikki, O ;
Laurila, K .
SPEECH COMMUNICATION, 1998, 25 (1-3) :133-147