Human-Object Interaction Prediction with Natural Language Supervision

被引：1

作者：

Li, Zhengxue ^{[1
,2
]}

An, Gaoyun ^{[1
,2
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

来源：

2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1 | 2022年

基金：

中国国家自然科学基金;

关键词：

Human-Object Interaction; Zero-Shot Learning; Natural Language Supervision; Precise Position Embedding;

D O I：

10.1109/ICSP56322.2022.9965210

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Although the dataset for the HOI task already contains a rich set of Human-Object Interaction types, it is impractical to label and learns all (object-interaction) combinations since the same objects can have different categories of interactions with humans. When some uncommon interaction combinations occur in real application scenarios, it is difficult for existing models to make correct predictions. To address these issues, we propose a novel Transformer-based HOI prediction model. The model converts the triad labels (human-interaction-object) of HOI tasks into natural language descriptions of images and uses the converted description sentences as new image labels to predict their interactions in the space of joint natural language and HOI interaction features. This approach transforms the image to triplet mapping problem into a mapping problem from image to natural language, so it can deal with uncommon HOI interaction combinations. In addition, we use a new image Precise Relative Position Embedding method for enhancing the distance perception between image instances and enhancing the instance relevance detection in the joint space. We can also apply our model to zero-sample learning experiments since it can identify new interaction combinations. Extensive experiments on the datasets SWIG-HOI and HICO-DET show that our model is noticeably improved compared to previous methods.

引用

页码：124 / 128

页数：5

共 19 条

[1]

Bansal A, 2020, AAAI CONF ARTIF INTE, V34, P10460

[2] Target detection and localization using. MIMO radars and sonars [J].

Bekkerman, Ilya ;

Tabrikian, Joseph .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (10) :3873-3883

[3]

Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13

[4] Learning to Detect Human-Object Interactions [J].

Chao, Yu-Wei ;

Liu, Yunfan ;

Liu, Xieyang ;

Zeng, Huayi ;

Deng, Jia .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389

[5]

Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]

[6]

Gao C., 2018, 2018 BRIT MACHINE VI

[7] Affordance Transfer Learning for Human-Object Interaction Detection [J].

Hou, Zhi ;

Yu, Baosheng ;

Qiao, Yu ;

Peng, Xiaojiang ;

Tao, Dacheng .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :495-504

[8]

Park Jihwan, 2022, P IEEECVF C COMPUTER, P1019

[9]

Pratt Sarah, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12349), P314, DOI 10.1007/978-3-030-58548-8_19

[10]

Tamura M., 2021, CVPR, P10410

← 1 2 →