ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

被引：19

作者：

Chiou, Meng-Jiun ^{[1
]}

Liao, Chun-Yu ^{[2
]}

Wang, Li-Wei ^{[2
]}

Zimmermann, Roger ^{[1
]}

Feng, Jiashi ^{[1
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] ASUS Intelligent Cloud Serv, Taipei, Taiwan

来源：

ICDAR '21: PROCEEDINGS OF THE 2021 WORKSHOP ON INTELLIGENT CROSS-DATA ANALYSIS AND RETRIEVAL | 2021年

关键词：

Human-Object Interaction; Action Detection; Video Understanding; AFFORDANCES;

D O I：

10.1145/3463944.3469097

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to suboptimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI(1) where our proposed approach serves as a solid baseline.

引用

页码：9 / 17

页数：9

共 48 条

[1]

[Anonymous], 2016, LEARNING

[2] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4] Learning to Detect Human-Object Interactions [J].

Chao, Yu-Wei ;

Liu, Yunfan ;

Liu, Xieyang ;

Zeng, Huayi ;

Deng, Jia .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389

[5] Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations [J].

Chiou, Meng-Jiun ;

Zimmermann, Roger ;

Feng, Jiashi .

IEEE ACCESS, 2021, 9 :50441-50451

[6]

Chiou MJ., 2020, P 28 ACM INT C MULTI, P3431

[7] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[8] RMPE: Regional Multi-Person Pose Estimation [J].

Fang, Hao-Shu ;

Xie, Shuqin ;

Tai, Yu-Wing ;

Lu, Cewu .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2353-2362

[9] SlowFast Networks for Video Recognition [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Malik, Jitendra ;

He, Kaiming .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210

[10]

Gao C., 2018, BMVC, DOI DOI 10.1109/RADAR.2018.8557284

← 1 2 3 4 5 →