Adaptive Inattentional Framework for Video Object Detection With Reward-Conditional Training

被引:18
作者
Rodriguez-Ramos, Alejandro [1 ]
Rodriguez-Vazquez, Javier [1 ]
Sampedro, Carlos [1 ]
Campoy, Pascual [1 ]
机构
[1] Univ Politecn Madrid UPM CSIC, Comp Vis & Aerial Robot Grp, Ctr Automat & Robot, Madrid 28006, Spain
关键词
Inattention; YOTO; reward-conditional training; deep learning; video object detection; reinforcement learning; CNN; LSTM; loss-conditional training;
D O I
10.1109/ACCESS.2020.3006191
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent object detection studies have been focused on video sequences, mostly due to the increasing demand of industrial applications. Although single-image architectures achieve remarkable results in terms of accuracy, they do not take advantage of particular properties of the video sequences and usually require high parallel computational resources, such as desktop GPUs. In this work, an inattentional framework is proposed, where the object context in video frames is dynamically reused in order to reduce the computation overhead. The context features corresponding to keyframes are fused into a synthetic feature map, which is further refined using temporal aggregation with ConvLSTMs. Furthermore, an inattentional policy has been learned to adaptively balance the accuracy and the amount of context reused. The inattentional policy has been learned under the reinforcement learning paradigm, and using our novel reward-conditional training scheme, which allows for policy training over a whole distribution of reward functions and enables the selection of a unique reward function at inference time. Our framework shows outstanding results on platforms with reduced parallelization capabilities, such as CPUs, achieving an average latency reduction up to 2.09 x, and obtaining FPS rates similar to their equivalent GPU platform, at the cost of a 1.11 x mAP reduction.
引用
收藏
页码:124451 / 124466
页数:16
相关论文
共 84 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
Aloimonos J., 1987, International Journal of Computer Vision, V1, P333, DOI 10.1007/BF00133571
[3]  
[Anonymous], 2016, Advances in Neural Information Processing Systems
[4]  
[Anonymous], 2017, 2017 2nd International Conference on Information Technology (INCIT)
[5]  
[Anonymous], 2017, ARXIV PREPRINT ARXIV
[6]  
Ba J., 2014, Multiple object recognition with visual attention
[7]   Accuracy, Training Time and Hardware Efficiency Trade-Offs for Quantized Neural Networks on FPGAs [J].
Bacchus, Pascal ;
Stewart, Robert ;
Komendantskaya, Ekaterina .
APPLIED RECONFIGURABLE COMPUTING. ARCHITECTURES, TOOLS, AND APPLICATIONS, ARC 2020, 2020, 12083 :121-135
[8]   Object Detection in Video with Spatiotemporal Sampling Networks [J].
Bertasius, Gedas ;
Torresani, Lorenzo ;
Shi, Jianbo .
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357
[9]  
Bochkovskiy A., 2020, arXiv:2004.10934
[10]   Active Object Localization with Deep Reinforcement Learning [J].
Caicedo, Juan C. ;
Lazebnik, Svetlana .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2488-2496