Learning Bimodal Structure in Audio-Visual Data

被引:25
|
作者
Monaci, Gianluca [1 ]
Vandergheynst, Pierre [2 ]
Sommer, Friedrich T. [1 ]
机构
[1] Univ Calif Berkeley, Redwood Ctr Theoret Neurosci, Berkeley, CA 94720 USA
[2] Ecole Polytech Fed Lausanne, Inst Elect Engn, CH-1015 Lausanne, Switzerland
来源
IEEE TRANSACTIONS ON NEURAL NETWORKS | 2009年 / 20卷 / 12期
基金
瑞士国家科学基金会; 美国国家科学基金会;
关键词
Audio-visual source localization; dictionary learning; matching pursuit (MP); multimodal data processing; sparse representation; SOURCE SEPARATION; SPARSE; REPRESENTATIONS; APPROXIMATION; RECOGNITION; EXTRACTION; SEQUENCES; SOUNDS; LEVEL;
D O I
10.1109/TNN.2009.2032182
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A novel model is presented to learn bimodally informative structures from audio-visual signals. The signal is represented as a sparse sum of audio-visual kernels. Each kernel is a bimodal function consisting of synchronous snippets of an audio waveform and a spatio-temporal visual basis function. To represent an audio-visual signal, the kernels can be positioned independently and arbitrarily in space and time. The proposed algorithm uses unsupervised learning to form dictionaries of bimodal kernels from audio-visual material. The basis functions that emerge during learning capture salient audio-visual data structures. In addition, it is demonstrated that the learned dictionary can be used to locate sources of sound in the movie frame. Specifically, in sequences containing two speakers, the algorithm can robustly localize a speaker even in the presence of severe acoustic and visual distracters.
引用
收藏
页码:1898 / 1910
页数:13
相关论文
共 50 条
  • [1] Audio-visual modeling for bimodal speech recognition
    Kaynak, MN
    Zhi, Q
    Cheok, AD
    Sengupta, K
    Chung, KC
    2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: E-SYSTEMS AND E-MAN FOR CYBERNETICS IN CYBERSPACE, 2002, : 181 - 186
  • [2] Bimodal fusion in audio-visual speech recognition
    Zhang, XZ
    Mersereau, RM
    Clements, M
    2002 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2002, : 964 - 967
  • [3] Multimodal Learning Using 3D Audio-Visual Data or Audio-Visual Speech Recognition
    Su, Rongfeng
    Wang, Lan
    Liu, Xunying
    2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 40 - 43
  • [4] Audio-Visual Paths to Learning
    McClusky, F. D.
    EDUCATION, 1947, 68 (03): : 190 - 190
  • [5] AUDIO-VISUAL AIDS TO LEARNING
    不详
    BMJ-BRITISH MEDICAL JOURNAL, 1966, 2 (5521): : 1023 - +
  • [6] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [7] The structure of audio-visual consciousness
    Skrzypulec, Blazej
    SYNTHESE, 2021, 198 (03) : 2101 - 2127
  • [8] A developmental model of audio-visual attention (MAVA) for bimodal language learning in infants and robots
    Bergoin, Raphael
    Boucenna, Sofiane
    D'Urso, Raphael
    Cohen, David
    Pitti, Alexandre
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [9] Bimodal audio-visual training enhances auditory adaptation process
    Kawase, Tetsuaki
    Sakamoto, Shuichi
    Hori, Yoko
    Maki, Atsuko
    Suzuki, Yoiti
    Kobayashi, Toshimitsu
    NEUROREPORT, 2009, 20 (14) : 1231 - 1234
  • [10] Paper: Speaker Localization Based on Audio-Visual Bimodal Fusion
    Zhu, Ying-Xin
    Jin, Hao-Ran
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2021, 25 (03) : 375 - 382