A real-time prototype for small-vocabulary audio-visual ASR

被引：0

作者：

Connell, JH ^{[1
]}

Haas, N ^{[1
]}

Marcheret, E ^{[1
]}

Neti, C ^{[1
]}

Potamianos, G ^{[1
]}

Velipasalar, S ^{[1
]}

机构：

[1] IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA

来源：

2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL II, PROCEEDINGS | 2003年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present a prototype for the automatic recognition of audiovisual speech, developed to augment the IBM ViaVoice(TM) speech recognition system. Frontal face, full frame video is captured through a USB 2.0 interface by means of an inexpensive PC camera, and processed to obtain appearance-based visual features. Subsequently, these are combined with audio features, synchronously extracted from the acoustic signal, using a simple discriminant feature fusion technique. On the average, the required computations utilize approximately 67% of a Pentium(TM) 4, 1.8 GHz processor, leaving the remaining resources available to hidden Markov model based speech recognition. Real-time performance is therefore achieved for small-vocabulary tasks, such as connected-digit recognition. In the paper, we discuss the prototype architecture based on the ViaVoice(TM) engine, the basic algorithms employed, and their necessary modifications to ensure real-time performance and causality of the visual front end processing. We benchmark the resulting system performance on stored videos against prior research experiments, and we report a close match between the two.

引用

页码：469 / 472

页数：4

共 50 条

[1] Real-Time Audio-Visual Analysis for Multiperson Videoconferencing
Motlicek, Petr
Duffner, Stefan
Korchagin, Danil
Bourlard, Herve
Scheffler, Carl
Odobez, Jean-Marc
Del Galdo, Giovanni
Kallinger, Markus
Thiergart, Oliver
ADVANCES IN MULTIMEDIA, 2013, 2013
[2] Application for Real-time Audio-Visual Speech Enhancement
Gogate, Mandar
Dashtipour, Kia
Hussain, Amir
INTERSPEECH 2023, 2023, : 2026 - 2027
[3] A Real-Time Text to Audio-Visual Speech Synthesis System
Wang, Lijuan
Qian, Xiaojun
Ma, Lei
Qian, Yao
Chen, Yining
Soong, Frank
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2338 - +
[4] Real-time audio-visual composition: Mugenkei as a Live Dream
Jentzsch, Wilfried
Detheux, Jean
INTERNATIONAL JOURNAL OF ARTS AND TECHNOLOGY, 2009, 2 (1-2) : 129 - 132
[5] Real-time Audio-Visual Media Transport over QUIC
Perkins, Colin
Ott, Joerg
EPIQ'18: PROCEEDINGS OF THE 2018 WORKSHOP ON THE EVOLUTION, PERFORMANCE, AND INTEROPERABILITY OF QUIC, 2018, : 36 - 42
[6] Real-Time Human Intrusion Detection Using Audio-Visual Fusion
Wang, Defu
Zheng, Shibao
Zhang, Chongyang
ADVANCES ON DIGITAL TELEVISION AND WIRELESS MULTIMEDIA COMMUNICATIONS, 2012, 331 : 82 - 89
[7] Real-time speaker localization and speech separation by audio-visual integration
Nakadai, K
Hidai, K
Okuno, HG
Kitano, H
2002 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, VOLS I-IV, PROCEEDINGS, 2002, : 1043 - 1049
[8] Real-Time Audio-Visual Calls Detection System for a Chicken Robot
Gribovskiy, Alexey
Mondada, Francesco
ICAR: 2009 14TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS, VOLS 1 AND 2, 2009, : 360 - 365
[9] Audio-visual vibraphone transcription in real time
Tavares, Tiago F.
Odowichuck, Gabrielle
Zehtabi, Sonmaz
Tzanetakis, George
2012 IEEE 14TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP), 2012, : 215 - 220
[10] Real time audio-visual person tracking
Talantzis, Fotios
Pnevmatikakis, Aristodemos
Polymenakos, Lazaros C.
2006 IEEE WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2006, : 243 - +

← 1 2 3 4 5 →