Coarse-to-Fine Localization of Temporal Action Proposals

被引:28
作者
Long, Fuchen [1 ]
Yao, Ting [2 ]
Qiu, Zhaofan [1 ]
Tian, Xinmei [1 ]
Mei, Tao [2 ]
Luo, Jiebo [3 ]
机构
[1] Univ Sci & Technol China, Elect Engn & Informat Sci, Hefei 230027, Peoples R China
[2] JD AI Res, Vis & Multimedia Lab, Beijing 100105, Peoples R China
[3] Univ Rochester, Dept Comp Sci, Rochester, NY 14604 USA
关键词
Proposals; Videos; Painting; Brushes; Microsoft Windows; Task analysis; Feature extraction; Action Proposals; Action Recognition; Action Detection; Video Captioning;
D O I
10.1109/TMM.2019.2943204
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Localizing temporal action proposals from long videos is a fundamental challenge in video analysis (e.g., action detection and recognition or dense video captioning). Most existing approaches often overlook the hierarchical granularities of actions and thus fail to discriminate fine-grained action proposals (e.g., hand washing laundry or changing a tire in vehicle repair). In this paper, we propose a novel coarse-to-fine temporal proposal (CFTP) approach to localize temporal action proposals by exploring different action granularities. Our proposed CFTP consists of three stages: a coarse proposal network (CPN) to generate long action proposals, a temporal convolutional anchor network (CAN) to localize finer proposals, and a proposal reranking network (PRN) to further identify proposals from previous stages. Specifically, CPN explores three complementary actionness curves (namely pointwise, pairwise, and recurrent curves) that represent actions at different levels for generating coarse proposals, while CAN refines these proposals by a multiscale cascaded 1D-convolutional anchor network. In contrast to existing works, our coarse-to-fine approach can progressively localize fine-grained action proposals. We conduct extensive experiments on two action benchmarks (THUMOS14 and ActivityNet v1.3) and demonstrate the superior performance of our approach when compared to the state-of-the-art techniques on various video understanding tasks.
引用
收藏
页码:1577 / 1590
页数:14
相关论文
共 54 条
[31]  
Niebles J. C., 2017, PROC BRIT MACH VIS C
[32]   Action and Event Recognition with Fisher Vectors on a Compact Feature Set [J].
Oneata, Dan ;
Verbeek, Jakob ;
Schmid, Cordelia .
2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, :1817-1824
[33]   Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation [J].
Qiu, Zhaofan ;
Yao, Ting ;
Mei, Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (04) :939-949
[34]   Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks [J].
Qiu, Zhaofan ;
Yao, Ting ;
Mei, Tao .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5534-5542
[35]   Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].
Ren, Shaoqing ;
He, Kaiming ;
Girshick, Ross ;
Sun, Jian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149
[36]   Temporal Action Detection using a Statistical Language Model [J].
Richard, Alexander ;
Gall, Juergen .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3131-3140
[37]   ImageNet Large Scale Visual Recognition Challenge [J].
Russakovsky, Olga ;
Deng, Jia ;
Su, Hao ;
Krause, Jonathan ;
Satheesh, Sanjeev ;
Ma, Sean ;
Huang, Zhiheng ;
Karpathy, Andrej ;
Khosla, Aditya ;
Bernstein, Michael ;
Berg, Alexander C. ;
Fei-Fei, Li .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 115 (03) :211-252
[38]   CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos [J].
Shou, Zheng ;
Chan, Jonathan ;
Zareian, Alireza ;
Miyazawa, Kazuyuki ;
Chang, Shih-Fu .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1417-1426
[39]   Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs [J].
Shou, Zheng ;
Wang, Dongang ;
Chang, Shih-Fu .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1049-1058
[40]  
Simonyan K, 2014, ADV NEUR IN, V27