Enhanced Attention Tracking With Multi-Branch Network for Egocentric Activity Recognition

被引：9

作者：

Liu, Tianshan ^{[1
]}

Lam, Kin-Man ^{[1
]}

Zhao, Rui ^{[1
]}

Kong, Jun ^{[2
]}

机构：

[1] Hong Kong Polytech Univ, Dept Elect & Informat Engn, Hong Kong, Peoples R China

[2] Jiangnan Univ, Key Lab Adv Proc Control Light Ind, Minist Educ, Wuxi 214122, Jiangsu, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 06期

关键词：

Activity recognition; Cams; Videos; Feature extraction; Optical imaging; Three-dimensional displays; Semantics; Egocentric activity recognition; attention tracking; multi-branch network; fine-grained hand-object interactions; CONVOLUTIONAL NETWORKS;

D O I：

10.1109/TCSVT.2021.3104651

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The emergence of wearable devices has opened up new potentials for egocentric activity recognition. Although some methods integrate attention mechanisms into deep neural networks to capture fine-grained human-object interactions in a weak-supervision manner, they either ignore exploiting the temporal consistency or generate attention based on considering appearance cues only. To address these limitations, in this paper, we propose an enhanced attention-tracking method, combined with multi-branch network (EAT-MBNet), for egocentric activity recognition. Specifically, we propose class-aware attention maps (CAAMs) by employing a self-attention-based module to refine the class activation maps (CAMs). Our proposed method can enhance the semantic dependency between the activity categories and the feature maps. To highlight the discriminative features from the regions of interest across frames, we propose a flow-guided attention-tracking (F-AT) module, by simultaneously leveraging historical attention and motion patterns. Furthermore, we propose a cross-modality modeling branch based on an interactive GRU module, which captures the time-synchronized long-term relationships between the appearance and motion branches. Experimental results on four egocentric activity benchmarks demonstrate that the proposed method achieves state-of-the-art performance.

引用

页码：3587 / 3602

页数：16

共 63 条

[1]

[Anonymous], INT C LEARNING REPRE

[2] Egocentric Gesture Recognition Using Recurrent 3D Convolutional Neural Networks with Spatiotemporal Transformer Modules [J].

Cao, Congqi ;

Zhang, Yifan ;

Wu, Yi ;

Lu, Hanqing ;

Cheng, Jian .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3783-3791

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines [J].

Damen, Dima ;

Doughty, Hazel ;

Farinella, Giovanni Maria ;

Fidler, Sanja ;

Furnari, Antonino ;

Kazakos, Evangelos ;

Moltisanti, Davide ;

Munro, Jonathan ;

Perrett, Toby ;

Price, Will ;

Wray, Michael .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) :4125-4141

[5]

Dollar P., 2005, Proceedings. 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS) (IEEE Cat. No. 05EX1178), P65

[6] Long-Term Recurrent Convolutional Networks for Visual Recognition and Description [J].

Donahue, Jeff ;

Hendricks, Lisa Anne ;

Rohrbach, Marcus ;

Venugopalan, Subhashini ;

Guadarrama, Sergio ;

Saenko, Kate ;

Darrell, Trevor .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :677-691

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8] Spatiotemporal Multiplier Networks for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Wildes, Richard P. .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7445-7454

[9] Convolutional Two-Stream Network Fusion for Video Action Recognition [J].

Feichtenhofer, Christoph ;

Pinz, Axel ;

Zisserman, Andrew .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941

[10] TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals [J].

Gao, Jiyang ;

Yang, Zhenheng ;

Sun, Chen ;

Chen, Kan ;

Nevatia, Ram .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3648-3656

← 1 2 3 4 5 6 7 →