A real-time prototype for small-vocabulary audio-visual ASR

被引:0
|
作者
Connell, JH [1 ]
Haas, N [1 ]
Marcheret, E [1 ]
Neti, C [1 ]
Potamianos, G [1 ]
Velipasalar, S [1 ]
机构
[1] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA
来源
2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS | 2003年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a prototype for the automatic recognition of audiovisual speech, developed to augment the IBM ViaVoice(TM) speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium(TM) 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is therefore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice(TM) engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.
引用
收藏
页码:469 / 472
页数:4
相关论文
共 50 条
  • [1] Real-Time Audio-Visual Analysis for Multiperson Videoconferencing
    Motlicek, Petr
    Duffner, Stefan
    Korchagin, Danil
    Bourlard, Herve
    Scheffler, Carl
    Odobez, Jean-Marc
    Del Galdo, Giovanni
    Kallinger, Markus
    Thiergart, Oliver
    ADVANCES IN MULTIMEDIA, 2013, 2013
  • [2] Application for Real-time Audio-Visual Speech Enhancement
    Gogate, Mandar
    Dashtipour, Kia
    Hussain, Amir
    INTERSPEECH 2023, 2023, : 2026 - 2027
  • [3] A Real-Time Text to Audio-Visual Speech Synthesis System
    Wang, Lijuan
    Qian, Xiaojun
    Ma, Lei
    Qian, Yao
    Chen, Yining
    Soong, Frank
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2338 - +
  • [4] Real-time audio-visual composition: Mugenkei as a Live Dream
    Jentzsch, Wilfried
    Detheux, Jean
    INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2009, 2 (1-2) : 129 - 132
  • [5] Real-time Audio-Visual Media Transport over QUIC
    Perkins, Colin
    Ott, Joerg
    EPIQ'18: PROCEEDINGS OF THE 2018 WORKSHOP ON THE EVOLUTION, PERFORMANCE, AND INTEROPERABILITY OF QUIC, 2018, : 36 - 42
  • [6] Real-Time Human Intrusion Detection Using Audio-Visual Fusion
    Wang, Defu
    Zheng, Shibao
    Zhang, Chongyang
    ADVANCES ON DIGITAL TELEVISION AND WIRELESS MULTIMEDIA COMMUNICATIONS, 2012, 331 : 82 - 89
  • [7] Real-time speaker localization and speech separation by audio-visual integration
    Nakadai, K
    Hidai, K
    Okuno, HG
    Kitano, H
    2002 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS I-IV, PROCEEDINGS, 2002, : 1043 - 1049
  • [8] Real-Time Audio-Visual Calls Detection System for a Chicken Robot
    Gribovskiy, Alexey
    Mondada, Francesco
    ICAR: 2009 14TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS, VOLS 1 AND 2, 2009, : 360 - 365
  • [9] Audio-visual vibraphone transcription in real time
    Tavares, Tiago F.
    Odowichuck, Gabrielle
    Zehtabi, Sonmaz
    Tzanetakis, George
    2012 IEEE 14TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2012, : 215 - 220
  • [10] Real time audio-visual person tracking
    Talantzis, Fotios
    Pnevmatikakis, Aristodemos
    Polymenakos, Lazaros C.
    2006 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2006, : 243 - +