Integration strategies for audio-visual speech processing: Applied to text-dependent speaker recognition

被引:21
作者
Lucey, S [1 ]
Chen, TH
Sridharan, S
Chandran, V
机构
[1] Carnegie Mellon Univ, Dept Elect & Comp Engn, Adv Multimedia Proc Lab, Pittsburgh, PA 15213 USA
[2] Queensland Univ Technol, Sch Elect & Elect Syst Engn, RCSAVT, Speech Res Lab, Brisbane, Qld 4001, Australia
关键词
audio-visual speech processing (AVSP); classifier combination; integration strategies; multistream hidden Markov model (HMM); speaker recognition;
D O I
10.1109/TMM.2005.846777
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, an in-depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well-known hidden Markov model (HMM) classifier framework for modeling speech. A novel framework for modeling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speech processing applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition.
引用
收藏
页码:495 / 506
页数:12
相关论文
共 31 条
[1]  
ADJONDANI A, 1995, P EUR 95 C MADR SPAI, P1563
[2]   Audio-visual integration in multimodal communication [J].
Chen, T ;
Rao, RR .
PROCEEDINGS OF THE IEEE, 1998, 86 (05) :837-852
[3]   A review of speech-based bimodal recognition [J].
Chibelushi, CC ;
Deravi, F ;
Mason, JSD .
IEEE TRANSACTIONS ON MULTIMEDIA, 2002, 4 (01) :23-37
[4]  
CHIBELUSHI CC, 1993, P EUR C SPEECH COMM, P157
[5]  
COX S, 1997, AUD VIS SPEECH PROC
[6]   Audio-Visual Speech Modeling for Continuous Speech Recognition [J].
Dupont, Stephane ;
Luettin, Juergen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2000, 2 (03) :141-151
[7]  
Fukunaga K., 1990, INTRO STAT PATTERN R
[8]  
Hart, 2006, PATTERN CLASSIFICATI
[9]   Acoustic-labial speaker verification [J].
Jourlin, P ;
Luettin, J ;
Genoud, D ;
Wassner, H .
PATTERN RECOGNITION LETTERS, 1997, 18 (09) :853-858
[10]  
Kamel MS, 2003, LECT NOTES COMPUT SC, V2709, P1