Understanding Atomic Hand-Object Interaction With Human Intention

被引：16

作者：

Fan, Hehe ^{[1
]}

Zhuo, Tao ^{[1
]}

Yu, Xin ^{[2
]}

Yang, Yi ^{[2
]}

Kankanhalli, Mohan ^{[1
]}

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore

[2] Univ Technol Sydney, Ctr Artificial Intelligence, Ultimo, NSW 2007, Australia

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2022年 / 32卷 / 01期

关键词：

Videos; Cognition; Pattern recognition; Three-dimensional displays; Fans; Task analysis; Neural networks; Hand-object interaction reasoning; action recognition; video analysis; deep neural networks;

D O I：

10.1109/TCSVT.2021.3058688

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Hand-object interaction plays a very important role when humans manipulate objects. While existing methods focus on improving hand-object recognition with fully automatic methods, human intention has been largely neglected in the recognition process, thus leading to undesirable interaction descriptions. To better interpret human-object interaction that is aligned to human intention, we argue that a reference specifying human intention should be taken into account. Thus, we propose a new approach to represent interactions while reflecting human purpose with three key factors, i.e., hand, object and reference. Specifically, we design a pattern of < hand-object, object-reference, hand, object, reference > (HOR) to recognize intention based atomic hand-object interactions. This pattern aims to model interactions with the states of hand, object, reference and their relationships. Furthermore, we design a simple yet effective Spatially Part-based (3+1)D convolutional neural network, namely SP(3+1)D, which leverages 3D and 1D convolutions to model visual dynamics and object position changes based on our HOR, respectively. With the help of our SP(3+1)D network, the recognition results are able to indicate human purposes accurately. To evaluate the proposed method, we annotate a Something-1.3k dataset, which contains 10 atomic hand-object interactions and about 130 videos for each interaction. Experimental results on Something-1.3k demonstrate the effectiveness of our SP(3+1)D network.

引用

页码：275 / 285

页数：11

共 44 条

[1]

[Anonymous], 2016, ECCV

[2] Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions [J].

Bambach, Sven ;

Lee, Stefan ;

Crandall, David J. ;

Yu, Chen .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1949-1957

[3]

Bewley A, 2016, IEEE IMAGE PROC, P3464, DOI 10.1109/ICIP.2016.7533003

[4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[5] Semantic Pooling for Complex Event Analysis in Untrimmed Videos [J].

Chang, Xiaojun ;

Yu, Yao-Liang ;

Yang, Yi ;

Xing, Eric P. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (08) :1617-1632

[6]

Di DL, 2019, 2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM 2019), P287, DOI [10.1109/BigMM.2019.00053, 10.1109/BigMM.2019.000-9]

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8]

Fan H., 2021, INT C LEARN REPR

[9]

Fan HH, 2020, AAAI CONF ARTIF INTE, V34, P10754

[10]

Fan HH, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P705

← 1 2 3 4 5 →