Understanding Atomic Hand-Object Interaction With Human Intention

被引:16
作者
Fan, Hehe [1 ]
Zhuo, Tao [1 ]
Yu, Xin [2 ]
Yang, Yi [2 ]
Kankanhalli, Mohan [1 ]
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[2] Univ Technol Sydney, Ctr Artificial Intelligence, Ultimo, NSW 2007, Australia
关键词
Videos; Cognition; Pattern recognition; Three-dimensional displays; Fans; Task analysis; Neural networks; Hand-object interaction reasoning; action recognition; video analysis; deep neural networks;
D O I
10.1109/TCSVT.2021.3058688
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of < hand-object, object-reference, hand, object, reference > (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recognition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.
引用
收藏
页码:275 / 285
页数:11
相关论文
共 44 条
[1]  
[Anonymous], 2016, ECCV
[2]   Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions [J].
Bambach, Sven ;
Lee, Stefan ;
Crandall, David J. ;
Yu, Chen .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1949-1957
[3]  
Bewley A, 2016, IEEE IMAGE PROC, P3464, DOI 10.1109/ICIP.2016.7533003
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   Semantic Pooling for Complex Event Analysis in Untrimmed Videos [J].
Chang, Xiaojun ;
Yu, Yao-Liang ;
Yang, Yi ;
Xing, Eric P. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (08) :1617-1632
[6]  
Di DL, 2019, 2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2019), P287, DOI [10.1109/BigMM.2019.00053, 10.1109/BigMM.2019.000-9]
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]  
Fan H., 2021, INT C LEARN REPR
[9]  
Fan HH, 2020, AAAI CONF ARTIF INTE, V34, P10754
[10]  
Fan HH, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P705