USING COMPRESSED AUDIO-VISUAL WORDS FOR MULTI-MODAL SCENE CLASSIFICATION

被引:0
|
作者
Kurcius, Jan J. [1 ]
Breckon, Toby P. [2 ]
机构
[1] Cranfield Univ, Cranfield MK43 0AL, Beds, England
[2] Univ Durham, Durham, England
来源
2014 INTERNATIONAL WORKSHOP ON COMPUTATIONAL INTELLIGENCE FOR MULTIMEDIA UNDERSTANDING (IWCIM) | 2014年
关键词
multi-resolution; bag of words; MFCC; compressed sensing; audio-visual; multi-modal; RECOGNITION; FEATURES;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We present a novel approach to scene classification using combined audio signal and video image features and compare this methodology to scene classification results using each modality in isolation. Each modality is represented using summary features, namely Mel-frequency Cepstral Coefficients (audio) and Scale Invariant Feature Transform (SIFT) (video) within a multi-resolution bag-of-features model. Uniquely, we extend the classical bag-of-words approach over both audio and video feature spaces, whereby we introduce the concept of compressive sensing as a novel methodology for multi-modal fusion via audiovisual feature dimensionality reduction. We perform evaluation over a range of environments showing performance that is both comparable to the state of the art (86%, over ten scene classes) and invariant to a ten-fold dimensionality reduction within the audio-visual feature space using our compressive representation approach.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Audio-Visual Scene Classification Based on Multi-modal Graph Fusion
    Lei, Han
    Chen, Ning
    INTERSPEECH 2022, 2022, : 4157 - 4161
  • [2] Audio-Visual Emotion Recognition System Using Multi-Modal Features
    Handa, Anand
    Agarwal, Rashi
    Kohli, Narendra
    INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE, 2021, 15 (04)
  • [3] Multi-modal audio-visual event recognition for football analysis
    Barnard, M
    Odobez, JM
    Bengio, S
    2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 469 - 478
  • [4] Multi-modal authentication system based on audio-visual data
    Debnath, Saswati
    Roy, Pinki
    PROCEEDINGS OF THE 2019 IEEE REGION 10 CONFERENCE (TENCON 2019): TECHNOLOGY, KNOWLEDGE, AND SOCIETY, 2019, : 2507 - 2512
  • [5] Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation
    Wang, Xiaoyu
    Kong, Xiangyu
    Peng, Xiulian
    Lu, Yan
    INTERSPEECH 2022, 2022, : 886 - 890
  • [6] Audio-visual flow - A variational approach to multi-modal flow estimation
    Hamid, R
    Bobick, A
    Yezzi, A
    ICIP: 2004 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1- 5, 2004, : 2563 - 2566
  • [7] Shot genre classification using compressed audio-visual features
    Sugano, M
    Isaksson, R
    Nakajima, Y
    Yanagihara, H
    2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 2, PROCEEDINGS, 2003, : 17 - 20
  • [8] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
    Geng, Jiajia
    Liu, Xin
    Cheung, Yiu-ming
    2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
  • [9] Generalized concept overlay for semantic multi-modal analysis of audio-visual content
    Mezaris, Vasileios
    Gidaros, Spyros
    Kompatsiaris, Ioannis
    PROCEEDINGS 2009 FOURTH INTERNATIONAL WORKSHOP ON SEMANTIC MEDIA ADAPTATION AND PERSONALIZATION, 2009, : 27 - 32
  • [10] Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing
    Mo, Shentong
    Tian, Yapeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,