Audiovisual Information Fusion in Human-Computer Interfaces and Intelligent Environments: A Survey

被引:78
作者
Shivappa, Shankar T. [1 ]
Trivedi, Mohan Manubhai [1 ]
Rao, Bhaskar D. [1 ]
机构
[1] Univ Calif San Diego, Dept Elect & Comp Engn, La Jolla, CA 92093 USA
基金
美国国家科学基金会;
关键词
Audiovisual fusion; dynamic Bayesian networks (DBNs); hidden Markov models; human activity analysis; human activity modeling; information fusion; machine learning; multimodal systems; SPEECH; RECOGNITION; IDENTIFICATION; COMBINATION; TRACKING; AUTHENTICATION; VISION; MODEL;
D O I
10.1109/JPROC.2010.2057231
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Microphones and cameras have been extensively used to observe and detect human activity and to facilitate natural modes of interaction between humans and intelligent systems. Human brain processes the audio and video modalities, extracting complementary and robust information from them. Intelligent systems with audiovisual sensors should be capable of achieving similar goals. The audiovisual information fusion strategy is a key component in designing such systems. In this paper, we exclusively survey the fusion techniques used in various audiovisual information fusion tasks. The fusion strategy used tends to depend mainly on the model, probabilistic or otherwise, used in the particular task to process sensory information to obtain higher level semantic information. The models themselves are task oriented. In this paper, we describe the fusion strategies and the corresponding models used in audiovisual tasks such as speech recognition, tracking, biometrics, affective state recognition, and meeting scene analysis. We also review the challenges and existing solutions and also unresolved or partially resolved issues in these fields. Specifically, we discuss established and upcoming work in hierarchical fusion strategies and cross-modal learning techniques, identifying these as critical areas of research in the future development of intelligent systems.
引用
收藏
页码:1692 / 1715
页数:24
相关论文
共 152 条
  • [31] Checka N, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS, P881
  • [32] CHEN L, 2005, P WORKSH MACH LEARN
  • [33] Multimodal human emotion/expression recognition
    Chen, LS
    Huang, TS
    Miyasato, T
    Nakatsu, R
    [J]. AUTOMATIC FACE AND GESTURE RECOGNITION - THIRD IEEE INTERNATIONAL CONFERENCE PROCEEDINGS, 1998, : 366 - 371
  • [34] Chen TH, 2001, IEEE SIGNAL PROC MAG, V18, P9
  • [35] Real-time speaker tracking using particle filter sensor fusion
    Chen, YQ
    Rui, Y
    [J]. PROCEEDINGS OF THE IEEE, 2004, 92 (03) : 485 - 494
  • [36] Biometric person authentication with liveness detection based on audio-visual fusion
    Chetty, Girija
    Wagner, Michael
    [J]. INTERNATIONAL JOURNAL OF BIOMETRICS, 2009, 1 (04) : 463 - 478
  • [37] CHIBELUSHI C, 1993, P IEEE INT S MULTIME
  • [38] An audio-visual corpus for speech perception and automatic speech recognition (L)
    Cooke, Martin
    Barker, Jon
    Cunningham, Stuart
    Shao, Xu
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) : 2421 - 2424
  • [39] Towards a Vision-based System Exploring 3D Driver Posture Dynamics for Driver Assistance: Issues and Possibilities
    Cuong Tran
    Trivedi, Mohan M.
    [J]. 2010 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2010, : 179 - 184
  • [40] Cutler R, 2000, 2000 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, PROCEEDINGS VOLS I-III, P1589, DOI 10.1109/ICME.2000.871073