Multi-Person Pose Estimation Using Bounding Box Constraint and LSTM

被引:55
作者
Li, Miaopeng [1 ]
Zhou, Zimeng [1 ]
Liu, Xinguo [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD & CG, Hangzhou 310058, Zhejiang, Peoples R China
关键词
Human pose estimation; convolutional neural network; bottom-up; bounding box constraint; long short-term memory; REPRESENTATION;
D O I
10.1109/TMM.2019.2903455
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a new method for single-image pose estimation of multiple people combining the traditional bottom-up and the top-down methods. Specifically, we extract features from the input image by a residual network and use a multistage CNN to learn both the confidence maps of joints and the connection relationships, between joints. During testing, we perform the network feedforwarding in a bottom-up manner, and then use the predicted confidence maps, the connection relationships, and the corresponding bounding boxes to parse the poses of all people in a top-down manner. In contrast to the previous top-down methods, our method is robust to bounding box shift and tightness, works well for largely overlapped people, and achieves faster running speed. In contrast to the bottom-up method, our method avoids mistake propagation across different people, and addresses disconnected joints effectively. To estimate human pose from videos, we impose a weight-sharing scheme to the multi-stage CNN, and rewrite it as a recurrent neural network. Thus, we can reuse the prediction results from the previous frames so as to reduce the total stage number, yielding significantly faster speed in invoking the network on videos. And we adopt LSTM units between frames to capture the temporal correlation among video frames. We found that LSTM handles input-quality degradation in videos well and successfully stabilizes the sequential outputs.
引用
收藏
页码:2653 / 2663
页数:11
相关论文
共 43 条
[1]  
Andriluka M, 2009, PROC CVPR IEEE, P1014, DOI 10.1109/CVPRW.2009.5206754
[2]  
[Anonymous], 2014, P IEEE INT C COMP VI
[3]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[4]  
Bradley D.M., 2010, Learning in modular systems
[5]   MixedEmotions: An Open-Source Toolbox for Multimodal Emotion Analysis [J].
Buitelaar, Paul ;
Wood, Ian D. ;
Negi, Sapna ;
Arcan, Mihael ;
McCrae, John P. ;
Abele, Andrejs ;
Robin, Cecile ;
Andryushechkin, Vladimir ;
Ziad, Housam ;
Sagha, Hesam ;
Schmitt, Maximilian ;
Schuller, Bjoern W. ;
Fernando Sanchez-Rada, J. ;
Iglesias, Carlos A. ;
Navarro, Carlos ;
Giefer, Andreas ;
Heise, Nicolaus ;
Masucci, Vincenzo ;
Danza, Francesco A. ;
Caterino, Ciro ;
Smrz, Pavel ;
Hradis, Michal ;
Povolny, Filip ;
Klimes, Marek ;
Matejka, Pavel ;
Tummarello, Giovanni .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (09) :2454-2465
[6]   Human Pose Estimation via Convolutional Part Heatmap Regression [J].
Bulat, Adrian ;
Tzimiropoulos, Georgios .
COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 :717-732
[7]   Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].
Cao, Zhe ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310
[8]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[9]  
Chen Xianjie, 2014, Advances in Neural Information Processing Systems
[10]   Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming [J].
Einfalt, Moritz ;
Zecha, Dan ;
Lienhart, Rainer .
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :446-455