Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization

被引:47
作者
Hong, Fa-Ting [1 ,3 ,4 ,6 ]
Feng, Jia-Chang [1 ,3 ,4 ,7 ]
Xu, Dan [5 ]
Shan, Ying [3 ]
Zheng, Wei-Shi [1 ,2 ,4 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Tencent PCG, Appl Res Ctr ARC, Shenzhen, Peoples R China
[4] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Beijing, Peoples R China
[5] HKUST, Dept Comp Sci & Engn, Hk, Peoples R China
[6] Pazhou Lab, Guangzhou, Peoples R China
[7] Sun Yat Sen Univ, Guangdong Key Lab Informat Secur Technol, Guangzhou, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
关键词
Weakly supervised learning; Temporal action localization; Feature; re-calibration; Mutual learning;
D O I
10.1145/3474085.3475298
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly, e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors, e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve the stateof-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.
引用
收藏
页码:1591 / 1599
页数:9
相关论文
共 54 条
  • [1] Afouras Triantafyllos, 2020, SELF SUPERVISED LEAR
  • [2] Alwassel Humam, 2020, ARXIV201111479
  • [3] Anyi Rao, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10143, DOI 10.1109/CVPR42600.2020.01016
  • [4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
  • [6] Choe J., 2019, CVPR
  • [7] Deng Cheng, 2018, TIP
  • [8] Feng Jia-Chang, 2021, CVPR
  • [9] Gong G., 2020, CVPR
  • [10] HEILBRON FC, 2015, PROC CVPR IEEE, P961, DOI DOI 10.1109/CVPR.2015.7298698