Weakly-Supervised Temporal Action Localization via Cross-Stream Collaborative Learning

被引：22

作者：

Ji, Yuan ^{[1
]}

Jia, Xu ^{[1
]}

Lu, Huchuan ^{[1
,2
]}

Ruan, Xiang ^{[3
]}

机构：

[1] Dalian Univ Technol, Dalian, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Tiwaki Co Ltd, Kusatsu, Japan

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

国家重点研发计划;

关键词：

Temporal action localization; Weakly supervised learning; Two; modalities; Collaborative learning;

D O I：

10.1145/3474085.3475261

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly supervised temporal action localization (WTAL) is a challenging task as only video-level category labels are available during training stage. Without precise temporal annotations, most approaches rely on complementary RGB and optical flow features to predict the start and end frame of each action category in a video. However, existing approaches simply resort to either concatenation or weighted sum to learn how to take advantages of these two modalities for accurate action localization, which ignore the substantial variance between such two modalities. In this paper, we present Cross-Stream Collaborative Learning (CSCL) to address these issues. The proposed CSCL introduce a cross-stream weighting module to identify which modality is more robust during training and take advantage of the robust modality to guide the weaker one. Furthermore, we suppress the snippets which has high action-ness scores in both modalities to further exploiting the complementary property between two modalities. In addition, we bring the concept of co-training for WTAL and take both modalities into account for pseudo label generation to help training a stronger model. Extensive experiments conducted on THUMOS14 and ActivityNet dataset demonstrate that CSCL achieves a favorable performance against state-of-the-arts methods.

引用

页码：853 / 861

页数：9

共 40 条

[1]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[3] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

[4]

Islam A, 2021, AAAI CONF ARTIF INTE, V35, P1637

[5]

Islam A, 2020, IEEE WINT CONF APPL, P536, DOI [10.1109/WACV45572.2020.9093620, 10.1109/wacv45572.2020.9093620]

[6]

Jiang Y.-G., 2014, THUMOS Chal- lenge: Action Recognition with a Large Number of Classes

[7]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[8]

Lee P, 2021, AAAI CONF ARTIF INTE, V35, P1854

[9]

Lee P, 2020, AAAI CONF ARTIF INTE, V34, P11320

[10] BSN: Boundary Sensitive Network for Temporal Action Proposal Generation [J].

Lin, Tianwei ;

Zhao, Xu ;

Su, Haisheng ;

Wang, Chongjing ;

Yang, Ming .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :3-21

← 1 2 3 4 →