HOTR: End-to-End Human-Object Interaction Detection with Transformers

被引:206
作者
Kim, Bumsoo [1 ,2 ]
Lee, Junhyun [2 ]
Kang, Jaewoo [2 ]
Kim, Eun-Sol [1 ]
Kim, Hyunwoo J. [2 ]
机构
[1] Kakao Brain, Seongnam, South Korea
[2] Korea Univ, Seoul, South Korea
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR46437.2021.00014
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of < human, object, interaction > triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
引用
收藏
页码:74 / 83
页数:10
相关论文
共 33 条
[1]  
[Anonymous], 2020, AAAI
[2]  
Carion Nicolas, 2020, EUROPEAN C COMPUTER
[3]   Learning to Detect Human-Object Interactions [J].
Chao, Yu-Wei ;
Liu, Yunfan ;
Liu, Xieyang ;
Zeng, Huayi ;
Deng, Jia .
2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389
[4]  
Gao C., 2018, BMVC, DOI DOI 10.1109/RADAR.2018.8557284
[5]  
Gao Chen, 2020, EUR C COMP VIS, P696
[6]   Detecting and Recognizing Human-Object Interactions [J].
Gkioxari, Georgia ;
Girshick, Ross ;
Dollar, Piotr ;
He, Kaiming .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8359-8367
[7]  
Glorot X., 2010, PROC 13 INT C ARTIF, P249
[8]  
Gupta Jitendra, 2015, ARXIV150504474
[9]   No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques [J].
Gupta, Tanmay ;
Schwing, Alexander ;
Hoiem, Derek .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9676-9684
[10]   Contextual Heterogeneous Graph Network for Human-Object Interaction Detection [J].
Hai Wang ;
Zheng, Wei-shi ;
Ling Yingbiao .
COMPUTER VISION - ECCV 2020, PT XVII, 2020, 12362 :248-264