Dynamic Bayesian networks for audio-visual speech recognition

被引:124
作者
Nefian, AV
Liang, LH
Pi, XB
Liu, XX
Murphy, K
机构
[1] Intel Corp, Microprocessor Res Labs, Santa Clara, CA 95052 USA
[2] Intel Corp, Microcomp Res Labs, Beijing 100020, Chaoyang Dist, Peoples R China
[3] Univ Calif Berkeley, Div Comp Sci, Berkeley, CA 94720 USA
关键词
audio-visual speech recognition; hidden Markov models; coupled hidden Markov models; factorial hidden Markov models; dynamic Bayesian networks;
D O I
10.1155/S1110865702206083
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The use of visual features in audio-visual speech recognition (AVSR) is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM) and the factorial HMM (FHMM), and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.
引用
收藏
页码:1274 / 1288
页数:15
相关论文
共 29 条
  • [1] ADJOUDANI A, 1995, EUROSPEECH95, P1563
  • [2] [Anonymous], ARTIF INTELL
  • [3] [Anonymous], THESIS U ILLINOIS UR
  • [4] Coupled hidden Markov models for complex action recognition
    Brand, M
    Oliver, N
    Pentland, A
    [J]. 1997 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1997, : 994 - 999
  • [5] BREGLER C, 1995, FIFTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, PROCEEDINGS, P494, DOI 10.1109/ICCV.1995.466899
  • [6] CASTLEMAN K. R., 1996, Digital image processing
  • [7] Chen TH, 2001, IEEE SIGNAL PROC MAG, V18, P9
  • [8] Chu S., 2000, P IEEE INT C SPOK LA, V2, P747
  • [9] Chu SM, 2002, INT CONF ACOUST SPEE, P2009
  • [10] Duda R. O., 2000, Pattern Classification and Scene Analysis, V2nd