A Bayesian approach to audio-visual speaker identification

被引:0
|
作者
Nefian, AV [1 ]
Liang, LH
Fu, TY
Liu, XX
机构
[1] Intel Corp, Microprocessor Res Labs, Santa Clara, CA 95051 USA
[2] Natl Tsing Hua Univ, Comp Sci & Technol Dept, Hsinchu, Taiwan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper we describe a text dependent audio-visual speaker identification approach that combines face recognition and audio-visual speech-based identification systems. The temporal sequence of audio and visual observations obtained from the acoustic speech and the shape of the mouth axe modeled using a set of coupled hidden Markov models (CHMM), one for each phoneme-viseme pair and for each person in the database. The use of CHMM in our system is justified by the capability of this model to describe the natural audio and visual state asynchrony as well as their conditional dependence over time. Next, the likelihood obtained for each person in the database is combined with the face recognition likelihood obtained using an embedded hidden Markov model (EHMM). Experimental results on XM2VTS database show that our system improves the accuracy of the audio-only or video-only speaker identification at all levels of acoustic signal-to-noise ratio (SNR) from 5 to 30db.
引用
收藏
页码:761 / 769
页数:9
相关论文
共 50 条
  • [31] Dynamic dependency tests for audio-visual speaker association
    Siracusa, Michael R.
    Fisher, John W., III
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL II, PTS 1-3, 2007, : 457 - +
  • [32] Audio-visual speaker recognition for video broadcast news
    Maison, B
    Neti, C
    Senior, A
    JOURNAL OF VLSI SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2001, 29 (1-2): : 71 - 79
  • [33] Audio-visual speaker tracking with importance particle filters
    Gatica-Perez, D
    Lathoud, G
    McCowan, I
    Odobez, JM
    Moore, D
    2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, 2003, : 25 - 28
  • [34] Audio-Visual Multilevel Fusion for Speech and Speaker Recognition
    Chetty, Girija
    Wagner, Michael
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 379 - 382
  • [35] Audio-Visual Speaker Recognition for Video Broadcast News
    Benoît Maison
    Chalapathy Neti
    Andrew Senior
    Journal of VLSI signal processing systems for signal, image and video technology, 2001, 29 : 71 - 79
  • [36] Target Active Speaker Detection with Audio-visual Cues
    Jiang, Yidi
    Tao, Ruijie
    Pan, Zexu
    Li, Haizhou
    INTERSPEECH 2023, 2023, : 3152 - 3156
  • [37] RETHINKING AUDIO-VISUAL SYNCHRONIZATION FOR ACTIVE SPEAKER DETECTION
    Wuerkaixi, Abudukelimu
    Zhang, You
    Duan, Zhiyao
    Zhang, Changshui
    2022 IEEE 32ND INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2022,
  • [38] Audio-visual speaker identification using dynamic facial movements and utterance phonetic content
    Asadpour, Vahid
    Homayounpour, Mohammad Mehdi
    Towhidkhah, Farzad
    APPLIED SOFT COMPUTING, 2011, 11 (02) : 2083 - 2093
  • [39] Weight estimation for audio-visual multi-level fusion in bimodal speaker identification
    Wu, Zhiyong
    Cai, Lianhong
    Meng, Helen M.
    INTELLIGENT COMPUTING IN SIGNAL PROCESSING AND PATTERN RECOGNITION, 2006, 345 : 1107 - 1112
  • [40] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458