Understanding Atomic Hand-Object Interaction With Human Intention

被引:16
作者
Fan, Hehe [1 ]
Zhuo, Tao [1 ]
Yu, Xin [2 ]
Yang, Yi [2 ]
Kankanhalli, Mohan [1 ]
机构
[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[2] Univ Technol Sydney, Ctr Artificial Intelligence, Ultimo, NSW 2007, Australia
关键词
Videos; Cognition; Pattern recognition; Three-dimensional displays; Fans; Task analysis; Neural networks; Hand-object interaction reasoning; action recognition; video analysis; deep neural networks;
D O I
10.1109/TCSVT.2021.3058688
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of < hand-object, object-reference, hand, object, reference > (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recognition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.
引用
收藏
页码:275 / 285
页数:11
相关论文
共 44 条
[11]  
Fan HH, 2019, AAAI CONF ARTIF INTE, P8263
[12]   Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos [J].
Fan, Hehe ;
Chang, Xiaojun ;
Cheng, De ;
Yang, Yi ;
Xu, Dong ;
Hauptmann, Alexander G. .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :736-744
[13]  
Fua P., 2020, UNSUPERVISED DOMAIN
[14]   The "something something" video database for learning and evaluating visual common sense [J].
Goyal, Raghav ;
Kahou, Samira Ebrahimi ;
Michalski, Vincent ;
Materzynska, Joanna ;
Westphal, Susanne ;
Kim, Heuna ;
Haenel, Valentin ;
Fruend, Ingo ;
Yianilos, Peter ;
Mueller-Freitag, Moritz ;
Hoppe, Florian ;
Thurau, Christian ;
Bax, Ingo ;
Memisevic, Roland .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851
[15]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555
[16]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[17]  
Hochreiter S., 1997, Neural Computation, V9, P1735
[18]  
Ji J., 2020, P IEEE CVF C COMP VI, P10236
[19]  
Kay W., 2017, The Kinetics Human Action Video Dataset
[20]  
Kingma D. P., 2015, INT C LEARN REPR