Non-Local Temporal Difference Network for Temporal Action Detection

被引:3
作者
He, Yilong [1 ,2 ]
Han, Xiao [1 ,2 ]
Zhong, Yong [1 ,2 ]
Wang, Lishun [1 ,2 ]
机构
[1] Chinese Acad Sci, Chengdu Inst Comp Applicat, Chengdu 610081, Peoples R China
[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100049, Peoples R China
关键词
temporal action detection; deep learning; convolutional neural networks; computer vision; video understanding;
D O I
10.3390/s22218396
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
As an important part of video understanding, temporal action detection (TAD) has wide application scenarios. It aims to simultaneously predict the boundary position and class label of every action instance in an untrimmed video. Most of the existing temporal action detection methods adopt a stacked convolutional block strategy to model long temporal structures. However, most of the information between adjacent frames is redundant, and distant information is weakened after multiple convolution operations. In addition, the durations of action instances vary widely, making it difficult for single-scale modeling to fit complex video structures. To address this issue, we propose a non-local temporal difference network (NTD), including a chunk convolution (CC) module, a multiple temporal coordination (MTC) module, and a temporal difference (TD) module. The TD module adaptively enhances the motion information and boundary features with temporal attention weights. The CC module evenly divides the input sequence into N chunks, using multiple independent convolution blocks to simultaneously extract features from neighboring chunks. Therefore, it realizes the information delivered from distant frames while avoiding trapping into the local convolution. The MTC module designs a cascade residual architecture, which realizes the multiscale temporal feature aggregation without introducing additional parameters. The NTD achieves a state-of-the-art performance on two large-scale datasets, 36.2% mAP@avg and 71.6% mAP@0.5 on ActivityNet-v1.3 and THUMOS-14, respectively.
引用
收藏
页数:15
相关论文
共 49 条
[1]  
Alwassel H., 2021, TSP TEMPORALLY SENSI, P3173
[2]  
[Anonymous], 2017, P IEEE C COMP VIS PA
[3]   Boundary Content Graph Neural Network for Temporal Action Proposal Generation [J].
Bai, Yueran ;
Wang, Yingying ;
Tong, Yunhai ;
Yang, Yang ;
Liu, Qiyue ;
Liu, Junhui .
COMPUTER VISION - ECCV 2020, PT XXVIII, 2020, 12373 :121-137
[4]   Soft-NMS - Improving Object Detection With One Line of Code [J].
Bodla, Navaneeth ;
Singh, Bharat ;
Chellappa, Rama ;
Davis, Larry S. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5562-5570
[5]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[6]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[7]   Rethinking the Faster R-CNN Architecture for Temporal Action Localization [J].
Chao, Yu-Wei ;
Vijayanarasimhan, Sudheendra ;
Seybold, Bryan ;
Ross, David A. ;
Deng, Jia ;
Sukthankar, Rahul .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :1130-1139
[8]  
Chen G, 2022, AAAI CONF ARTIF INTE, P248
[9]   MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [J].
Dai, Rui ;
Das, Srijan ;
Kahatapitiya, Kumara ;
Ryoo, Michael S. ;
Bremond, Francois .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :20009-20019
[10]   PDAN: Pyramid Dilated Attention Network for Action Detection [J].
Dai, Rui ;
Das, Srijan ;
Minciullo, Luca ;
Garattoni, Lorenzo ;
Francesca, Gianpiero ;
Bremond, Francois .
2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, :2969-2978