Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

被引:0
作者
Koji Iwano
Tomoaki Yoshinaga
Satoshi Tamura
Sadaoki Furui
机构
[1] Tokyo Institute of Technology,Department of Computer Science
来源
EURASIP Journal on Audio, Speech, and Music Processing | / 2007卷
关键词
Visual Information; Acoustics; Visual Feature; Recognition Accuracy; Recognition Method;
D O I
暂无
中图分类号
学科分类号
摘要
This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.
引用
收藏
相关论文
共 33 条
  • [1] Bregler C(1994)"Eigenlips" for robust speech recognition Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '94) 2 669-672
  • [2] Konig Y(1996)Integrating audio and visual information to provide highly robust speech recognition Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '96) 2 821-824
  • [3] Tomlinson MJ(2000)Audio-visual speech modeling for continuous speech recognition IEEE Transactions on Multimedia 2 141-151
  • [4] Russell MJ(2000)Bimodal speech recognition using coupled hidden Markov models Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP '00) 2 747-750
  • [5] Brooke NM(2000)Audio-visual speech recognition using MCE-based HMMs and model-dependent stream weights Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP '00) 2 1023-1026
  • [6] Dupont S(2004)Multi-modal speech recognition using optical-flow analysis for lip images Journal of VLSI Signal Processing—Systems for Signal, Image, and Video Technology 36 117-124
  • [7] Luettin J(2004)Improvement of audio-visual speech recognition in cars Proceedings of the 18th International Congress on Acoustics (ICA '04) 4 2595-2598
  • [8] Chu SM(1981)Determining optical flow Artificial Intelligence 17 185-203
  • [9] Huang TS(1995)Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models Computer Speech and Language 9 171-185
  • [10] Miyajima C(1998)Discriminative training of HMM stream exponents for audio-visual speech recognition Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '98) 6 3733-3736