ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

被引:19
作者
Chiou, Meng-Jiun [1 ]
Liao, Chun-Yu [2 ]
Wang, Li-Wei [2 ]
Zimmermann, Roger [1 ]
Feng, Jiashi [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] ASUS Intelligent Cloud Serv, Taipei, Taiwan
来源
ICDAR '21: PROCEEDINGS OF THE 2021 WORKSHOP ON INTELLIGENT CROSS-DATA ANALYSIS AND RETRIEVAL | 2021年
关键词
Human-Object Interaction; Action Detection; Video Understanding; AFFORDANCES;
D O I
10.1145/3463944.3469097
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to suboptimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI(1) where our proposed approach serves as a solid baseline.
引用
收藏
页码:9 / 17
页数:9
相关论文
共 48 条
[1]  
[Anonymous], 2016, LEARNING
[2]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   Learning to Detect Human-Object Interactions [J].
Chao, Yu-Wei ;
Liu, Yunfan ;
Liu, Xieyang ;
Zeng, Huayi ;
Deng, Jia .
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389
[5]   Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations [J].
Chiou, Meng-Jiun ;
Zimmermann, Roger ;
Feng, Jiashi .
IEEE ACCESS, 2021, 9 :50441-50451
[6]  
Chiou MJ., 2020, P 28 ACM INT C MULTI, P3431
[7]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[8]   RMPE: Regional Multi-Person Pose Estimation [J].
Fang, Hao-Shu ;
Xie, Shuqin ;
Tai, Yu-Wing ;
Lu, Cewu .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2353-2362
[9]   SlowFast Networks for Video Recognition [J].
Feichtenhofer, Christoph ;
Fan, Haoqi ;
Malik, Jitendra ;
He, Kaiming .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6201-6210
[10]  
Gao C., 2018, BMVC, DOI DOI 10.1109/RADAR.2018.8557284