Multi-Scale Structure-Aware Network for Weakly Supervised Temporal Action Detection

被引：25

作者：

Yang, Wenfei ^{[1
]}

Zhang, Tianzhu ^{[1
]}

Mao, Zhendong ^{[1
]}

Zhang, Yongdong ^{[1
]}

Tian, Qi ^{[2
]}

Wu, Feng ^{[1
]}

机构：

[1] Univ Sci & Technol China, Sch Informat Sci, Hefei 230027, Peoples R China

[2] Huawei, Cloud BU, Shenzhen 518129, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2021年 / 30卷

关键词：

Proposals; Feature extraction; Image segmentation; Scalability; Noise measurement; Graph neural networks; GSM; Weakly supervised; action detection; multi-scale; structure-aware;

D O I：

10.1109/TIP.2021.3089361

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Weakly supervised temporal action detection has better scalability and practicability than fully supervised action detection in reality deployment. However, it is difficult to learn a robust model without temporal action boundary annotations. In this paper, we propose an en-to-end Multi-Scale Structure-Aware Network (MSA-Net) for weakly supervised temporal action detection by exploring both the global structure information of a video and the local structure information of actions. The proposed SA-Net enjoys several merits. First, to localize actions with different durations, each video is encoded into feature representations with different temporal scales. Second, based on the multi-scale feature representation, the proposed model has designed two effective structure modeling mechanisms including global structure modeling and local structure modeling, which can effectively learn discriminative structure aware representations for robust and complete action detection. To the best of our knowledge, this is the first work to fully explore the global and local structure information in a unified deep model for weakly supervised action detection. And extensive experimental results on two benchmark datasets demonstrate that the proposed MSA-Net performs favorably against state-of-the-art methods.

引用

页码：5848 / 5861

页数：14

共 66 条

[1]

[Anonymous], 2006, Advances in Neural Information Processing Systems

[2]

[Anonymous], 2017, P IEEE C COMP VIS PA

[3]

[Anonymous], 2018, INT C LEARN REPR

[4]

[Anonymous], P INT C LEARN REPR

[5] Finding Actors and Actions in Movies [J].

Bojanowski, P. ;

Bach, F. ;

Laptev, I. ;

Ponce, J. ;

Schmid, C. ;

Sivic, J. .

2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :2280-2287

[6]

Bojanowski P, 2014, LECT NOTES COMPUT SC, V8693, P628, DOI 10.1007/978-3-319-10602-1_41

[7]

Bruna J., 2014, ICLR

[8]

Buch S., 2017, P BMVC, V2, P7

[9]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[10] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

← 1 2 3 4 5 6 7 →