Weakly-Supervised Temporal Action Localization with Multi-Head Cross-Modal Attention

被引：3

作者：

Ren, Hao ^{[1
]}

Ren, Haoran ^{[1
]}

Ran, Wu ^{[1
]}

Lu, Hong ^{[1
]}

Jin, Cheng ^{[1
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Intelligent Informat Proc, Sch Comp Sci, Shanghai, Peoples R China

来源：

PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III | 2022年 / 13631卷

基金：

中国国家自然科学基金;

关键词：

Video analysis; Temporal action localization; Weakly-supervised learning; Attention;

D O I：

10.1007/978-3-031-20868-3_21

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly-supervised temporal action localization seeks to localize temporal boundaries of actions while concurrently identifying their categories using only video-level category labels during training. Among the existing methods, the modal cooperation methods have achieved great success by providing pseudo supervision signals to RGB and Flow features. However, most of these methods ignore the cross-correlation between modal characteristics which can help them learn better features. By considering the cross-correlation, we propose a novel multi-head cross-modal attention mechanism to explicitly model the cross-correlation of modal features. The proposed method collaboratively enhances RGB and Flow features through a cross-correlation matrix. In this way, the enhanced features for each modality encode the inter-modal information, while preserving the exclusive and meaningful intra-modal characteristics. Experimental results on three recent methods demonstrate that the proposed Multi-head Cross-modal Attention (MCA) mechanism can significantly improve the performance of these methods, and even achieve state-of-the-art results on the THUMOS14 and ActivityNet1.2 datasets.

引用

页码：281 / 295

页数：15

共 33 条

[1]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[4]

Dosovitskiy A., 2021, P 9 INT C LEARN REPR

[5] Multi-modal Transformer for Video Retrieval [J].

Gabeur, Valentin ;

Sun, Chen ;

Alahari, Karteek ;

Schmid, Cordelia .

COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :214-229

[6]

He B., 2022, IEEE C COMPUTER VISI

[7] Cross-modal Consensus Network forWeakly Supervised Temporal Action Localization [J].

Hong, Fa-Ting ;

Feng, Jia-Chang ;

Xu, Dan ;

Shan, Ying ;

Zheng, Wei-Shi .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :1591-1599

[8]

Huang L., 2022, IEEE C COMPUTER VISI

[9] The THUMOS challenge on action recognition for videos "in the wild" [J].

Idrees, Haroon ;

Zamir, Amir R. ;

Jiang, Yu-Gang ;

Gorban, Alex ;

Laptev, Ivan ;

Sukthankar, Rahul ;

Shah, Mubarak .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 155 :1-23

[10]

Islam A, 2021, AAAI CONF ARTIF INTE, V35, P1637

← 1 2 3 4 →