Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

被引：14

作者：

Alfasly, Saghir ^{[1
,2
]}

Lu, Jian ^{[1
,3
]}

Xu, Chen ^{[1
,2
]}

Zou, Yuru ^{[1
]}

机构：

[1] Shenzhen Univ, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen, Peoples R China

[2] Guangdong Key Lab Intelligent Informat Proc, Shenzhen, Peoples R China

[3] Pazhou Lab, Guangzhou, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52688.2022.01957

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our frame-work as it outperforms most relevant action recognition methods.

引用

页码：20176 / 20185

页数：10

共 51 条

[1] Modality Dropout for Improved Performance-driven Talking Faces [J].

Abdelaziz, Ahmed Hussen ;

Theobald, Barry-John ;

Dixon, Paul ;

Knothe, Reinhard ;

Apostoloff, Nicholas ;

Kajareker, Sachin .

PROCEEDINGS OF THE 2020 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2020, 2020, :378-386

[2]

Akbari Hassan, 2021, NeurIPS

[3]

Alayrac Jean Baptiste, 2020, ADV NEURAL INFORM PR, VDecem, P1

[4]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPRW53098.2021.00254

[5] Look, Listen and Learn [J].

Arandjelovic, Relja ;

Zisserman, Andrew .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :609-617

[6]

Arandjelovic R., 2018, EUR C COMP VIS ECCV

[7]

Arevalo John, 2019, 5 INT C LEARN REPR I

[8]

Arnab Anurag, 2021, ViViT: A Video Vision Transformer

[9]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[10]

Bertasius Gedas, 2021, P INT C MACH LEARN, V139, P813, DOI DOI 10.48550/ARXIV.2102.05095

← 1 2 3 4 5 6 →