Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization

被引：63

作者：

Hong, Fa-Ting ^{[1
,3
,4
,6
]}

Feng, Jia-Chang ^{[1
,3
,4
,7
]}

Xu, Dan ^{[5
]}

Shan, Ying ^{[3
]}

Zheng, Wei-Shi ^{[1
,2
,4
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Tencent PCG, Appl Res Ctr ARC, Shenzhen, Peoples R China

[4] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China

[5] HKUST, Dept Comp Sci & Engn, Hk, Peoples R China

[6] Pazhou Lab, Guangzhou, Peoples R China

[7] Sun Yat Sen Univ, Guangdong Key Lab Informat Secur Technol, Guangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

关键词：

Weakly supervised learning; Temporal action localization; Feature; re-calibration; Mutual learning;

D O I：

10.1145/3474085.3475298

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly, e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors, e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve the stateof-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

引用

页码：1591 / 1599

页数：9

共 54 条

[1]

Afouras Triantafyllos, 2020, Self-supervised learning of audio-visual objects from video

[2]

Alwassel Humam, 2020, ARXIV201111479

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[6] Attention-based Dropout Layer for Weakly Supervised Object Localization [J].

Choe, Junsuk ;

Shim, Hyunjung .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2214-2223

[7]

Deng Cheng, 2018, TIP

[8] MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection [J].

Feng, Jia-Chang ;

Hong, Fa-Ting ;

Zheng, Wei-Shi .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :14004-14013

[9]

Gong G., 2020, CVPR

[10]

Hong Fa-Ting, 2020, ECCV

← 1 2 3 4 5 6 →