Understanding Atomic Hand-Object Interaction With Human Intention

被引：16

作者：

Fan, Hehe ^{[1
]}

Zhuo, Tao ^{[1
]}

Yu, Xin ^{[2
]}

Yang, Yi ^{[2
]}

Kankanhalli, Mohan ^{[1
]}

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore

[2] Univ Technol Sydney, Ctr Artificial Intelligence, Ultimo, NSW 2007, Australia

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 01期

关键词：

Videos; Cognition; Pattern recognition; Three-dimensional displays; Fans; Task analysis; Neural networks; Hand-object interaction reasoning; action recognition; video analysis; deep neural networks;

D O I：

10.1109/TCSVT.2021.3058688

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of < hand-object, object-reference, hand, object, reference > (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recognition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.

引用

页码：275 / 285

页数：11

共 44 条

[11]

Fan HH, 2019, AAAI CONF ARTIF INTE, P8263

[12] Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos [J].

Fan, Hehe ;

Chang, Xiaojun ;

Cheng, De ;

Yang, Yi ;

Xu, Dong ;

Hauptmann, Alexander G. .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :736-744

[13]

Fua P., 2020, UNSUPERVISED DOMAIN

[14] The "something something" video database for learning and evaluating visual common sense [J].

Goyal, Raghav ;

Kahou, Samira Ebrahimi ;

Michalski, Vincent ;

Materzynska, Joanna ;

Westphal, Susanne ;

Kim, Heuna ;

Haenel, Valentin ;

Fruend, Ingo ;

Yianilos, Peter ;

Mueller-Freitag, Moritz ;

Hoppe, Florian ;

Thurau, Christian ;

Bax, Ingo ;

Memisevic, Roland .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5843-5851

[15] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].

Hara, Kensho ;

Kataoka, Hirokatsu ;

Satoh, Yutaka .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555

[16] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[17]

Hochreiter S., 1997, Neural Computation, V9, P1735

[18]

Ji J., 2020, P IEEE CVF C COMP VI, P10236

[19]

Kay W., 2017, The Kinetics Human Action Video Dataset

[20]

Kingma D. P., 2015, INT C LEARN REPR

← 1 2 3 4 5 →