Dividing and Aggregating Network for Multi-view Action Recognition

被引:43
作者
Wang, Dongang [1 ]
Ouyang, Wanli [1 ,2 ]
Li, Wen [3 ]
Xu, Dong [1 ]
机构
[1] Univ Sydney, Sch Elect & Informat Engn, Camperdown, NSW, Australia
[2] Univ Sydney, SenseTime Comp Vis Res Grp, Camperdown, NSW, Australia
[3] Swiss Fed Inst Technol, Comp Vis Lab, Zurich, Switzerland
来源
COMPUTER VISION - ECCV 2018, PT IX | 2018年 / 11213卷
关键词
Dividing and Aggregating Network; Multi-view action recognition; Large-scale action recognition;
D O I
10.1007/978-3-030-01240-3_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a new Dividing and Aggregating Network (DA-Net) for multi-view action recognition. In our DA-Net, we learn view-independent representations shared by all views at lower layers, while we learn one view-specific representation for each view at higher layers. We then train view-specific action classifiers based on the view-specific representation for each view and a view classifier based on the shared representation at lower layers. The view classifier is used to predict how likely each video belongs to each view. Finally, the predicted view probabilities from multiple views are used as the weights when fusing the prediction scores of view-specific action classifiers. We also propose a new approach based on the conditional random field (CRF) formulation to pass message among view-specific representations from different branches to help each other. Comprehensive experiments on two benchmark datasets clearly demonstrate the effectiveness of our proposed DA-Net for multi-view action recognition.
引用
收藏
页码:457 / 473
页数:17
相关论文
共 45 条
[1]  
[Anonymous], 2017, IEEE C COMP VIS PATT
[2]  
[Anonymous], 2015, THUMOS challenge: Action recognition with a large number of classes
[3]  
Baradel F., 2017, Pose-conditioned Spatio-Temporal Attention for Human Action Recognition
[4]   Human Action Recognition: Pose-based Attention draws focus to Hands [J].
Baradel, Fabien ;
Wolf, Christian ;
Mille, Julien .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, :604-613
[5]  
Chen M, 2011, INT CONF CLOUD COMPU, P316, DOI 10.1109/CCIS.2011.6045082
[6]   Actionness Ranking with Lattice Conditional Ordinal Random Fields [J].
Chen, Wei ;
Xiong, Caiming ;
Xu, Ran ;
Corso, Jason J. .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :748-755
[7]  
Chu X., 2016, P 30 INT C NEUR INF, P316
[8]   Structured Feature Learning for Pose Estimation [J].
Chu, Xiao ;
Ouyang, Wanli ;
Li, Hongsheng ;
Wang, Xiaogang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4715-4723
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Convolutional Two-Stream Network Fusion for Video Action Recognition [J].
Feichtenhofer, Christoph ;
Pinz, Axel ;
Zisserman, Andrew .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1933-1941