Entropy guided attention network for weakly-supervised action localization

被引：11

作者：

Cheng, Yi ^{[1
]}

Sun, Ying ^{[1
,2
]}

Fan, Hehe ^{[4
]}

Zhuo, Tao ^{[5
]}

Lim, Joo-Hwee ^{[1
,2
,3
]}

Kankanhalli, Mohan ^{[4
]}

机构：

[1] Agcy Sci Technol & Res, Inst Infocomm Res, Singapore 138632, Singapore

[2] Agcy Sci Technol & Res, Ctr Frontier AI Res, Singapore 138632, Singapore

[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

[4] Natl Univ Singapore, Sch Comp, Singapore 117417, Singapore

[5] Qilu Univ Technol, Shandong Artificial Intelligence Inst, Shandong Acad Sci, Jinan 250014, Peoples R China

来源：

PATTERN RECOGNITION | 2022年 / 129卷

基金：

中国国家自然科学基金;

关键词：

Temporal action localization; Weakly-supervised learning; Entropy guided loss; Global similarity loss; REPRESENTATION;

D O I：

10.1016/j.patcog.2022.108718

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

One major challenge of Weakly-supervised Temporal Action Localization (WTAL) is to handle diverse backgrounds in videos. To model background frames, most existing methods treat them as an additional action class. However, because background frames usually do not share common semantics, squeezing all the different background frames into a single class hinders network optimization. Moreover, the network would be confused and tends to fail when tested on videos with unseen background frames. To address this problem, we propose an Entropy Guided Attention Network (EGA-Net) to treat background frames as out-of-domain samples. Specifically, we design a two-branch module, where a domain branch detects whether a frame is an action by learning a class-agnostic attention map, and an action branch recognizes the action category of the frame by learning a class-specific attention map. By aggregating the two attention maps to model the joint domain-class distribution of frames, our EGA-Net can handle varying backgrounds. To train the class-agnostic attention map with only the video-level class labels, we propose an Entropy Guided Loss (EGL), which employs entropy as the supervision signal to distinguish action and background. Moreover, we propose a Global Similarity Loss (GSL) to enhance the action-specific attention map via action class center. Extensive experiments on THUMOS14, ActivityNet1.2 and ActivityNet1.3 datasets demonstrate the effectiveness of our EGA-Net. (C) 2022 Elsevier Ltd. All rights reserved.

引用

页数：11

共 59 条

[1]

Buch S., 2019, P BRIT MACHINE VISIO

[2]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].

Chao, Yu-Wei ;

Vijayanarasimhan, Sudheendra ;

Seybold, Bryan ;

Ross, David A. ;

Deng, Jia ;

Sukthankar, Rahul .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139

[5] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[6]

Fan H., 2021, P INT C LEARNING REP

[7]

Fan H., 2021, PROC IEEECVF C COMPU, P14204

[8] Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition [J].

Fan, Hehe ;

Yu, Xin ;

Yang, Yi ;

Kankanhalli, Mohan .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) :9918-9930

[9] Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling [J].

Fan, Hehe ;

Yang, Yi ;

Kankanhalli, Mohan .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) :2181-2192

[10] Understanding Atomic Hand-Object Interaction With Human Intention [J].

Fan, Hehe ;

Zhuo, Tao ;

Yu, Xin ;

Yang, Yi ;

Kankanhalli, Mohan .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :275-285

← 1 2 3 4 5 6 →