Auxiliary audio-textual modalities for better action recognition on vision-specific annotated videos

被引:2
作者
Alfasly, Saghir [1 ,2 ]
Lu, Jian [1 ,3 ]
Xu, Chen [1 ,2 ]
Li, Yu [1 ]
Zou, Yuru [3 ]
机构
[1] Shenzhen Univ, Coll Math & Stat, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen 518060, Peoples R China
[2] Guangdong Key Lab Intelligent Informat Proc, Shenzhen 518060, Peoples R China
[3] Natl Ctr Appl Math Shenzhen NCAMS, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Action recognition; Multimodal training; Large language models; Video transformer; Audio-visual training;
D O I
10.1016/j.patcog.2024.110808
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most current audio-visual datasets are class-relevant, where audio and visual modalities are annotated. Thus, current audio-visual recognition methods apply cross-modality attention or modality fusion. However, leveraging the audio modality effectively in vision-specific videos for human activity recognition is of particular challenge. We address this challenge by proposing a novel audio-visual recognition framework that effectively leverages audio modality in any vision-specific annotated dataset. The proposed framework employs language models (e.g., e.g ., GPT-3, CPT-text, BERT) for building a semantic audio-video label dictionary (SAVLD) that serves as a bridge between audio and video datasets by mapping each video label to its most K-relevant audio labels. Then, SAVLD along with a pre-trained audio multi-label model are used to estimate the audio-visual modality relevance. Accordingly, we propose a novel learnable irrelevant modality dropout (IMD) to completely drop the irrelevant audio modality and fuse only the relevant modalities. Finally, for the efficiency of the proposed multimodal framework, we present an efficient two-stream video Transformer to process the visual modalities (i.e., i.e ., RGB frames and optical flow). The final predictions are re-ranked with GPT-3 recommendations of the human activity classes. GPT-3 provides high-level recommendations using the labels of the detected visual objects and the audio predictions of the input video. Our framework demonstrated a remarkable performance on the vision-specific annotated datasets Kinetics400 and UCF-101 by outperforming most relevant human activity recognition methods.
引用
收藏
页数:11
相关论文
共 40 条
[1]  
Abdelaziz AH, 2020, PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2020, P378, DOI 10.1145/3382507.3418840
[2]  
Akbari H, 2021, Arxiv, DOI [arXiv:2104.11178, DOI 10.48550/ARXIV.2104.11178]
[3]  
Alayrac J.B., 2020, Adv. Neural Inf. Process. Syst., P1
[4]   An Effective Video Transformer With Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition [J].
Alfasly, Saghir ;
Chui, Charles K. ;
Jiang, Qingtang ;
Lu, Jian ;
Xu, Chen .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) :2496-2509
[5]   Look, Listen and Learn [J].
Arandjelovic, Relja ;
Zisserman, Andrew .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617
[6]  
Arevalo J., 2019, arXiv
[7]  
Arnab A., 2021, arXiv
[8]  
Ba J. L., 2016, arXiv, DOI 10.48550/arXiv:1607.06450
[9]  
Bertasius G, 2021, Arxiv, DOI [arXiv:2102.05095, DOI 10.48550/ARXIV.2102.05095]
[10]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733