Human-Object Interaction Detection with Ratio-Transformer

被引：0

作者：

Wang, Tianlang ^{[1
]}

Lu, Tao ^{[1
]}

Fang, Wenhua ^{[1
]}

Zhang, Yanduo ^{[1
]}

机构：

[1] Wuhan Inst Technol, Hubei Key Lab Intelligent Robot, Sch Comp Sci & Engn, Wuhan 430000, Peoples R China

来源：

SYMMETRY-BASEL | 2022年 / 14卷 / 08期

基金：

中国国家自然科学基金;

关键词：

human-object interaction; end-to-end; attention mechanism; transformer; symmetry; sampler; VCOCO;

D O I：

10.3390/sym14081666

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Human-object interaction (HOI) is a human-centered object detection task that aims to identify the interactions between persons and objects in an image. Previous end-to-end methods have used the attention mechanism of a transformer to spontaneously identify the associations between persons and objects in an image, which effectively improved detection accuracy; however, a transformer can increase computational demands and slow down detection processes. In addition, the end-to-end method can result in asymmetry between foreground and background information. The foreground data may be significantly less than the background data, while the latter consumes more computational resources without significantly improving detection accuracy. Therefore, we proposed an input-controlled transformer, "ratio-transformer" to solve an HOI task, which could not only limit the amount of information in the input transformer by setting a sampling ratio, but also significantly reduced the computational demands while ensuring detection accuracy. The ratio-transformer consisted of a sampling module and a transformer network. The sampling module divided the input feature map into foreground versus background features. The irrelevant background features were a pooling sampler, which were then fused with the foreground features as input data for the transformer. As a result, the valid data input into the Transformer network remained constant, while irrelevant information was significantly reduced, which maintained the foreground and background information symmetry. The proposed network was able to learn the feature information of the target itself and the association features between persons and objects so it could query to obtain the complete HOI interaction triplet. The experiments on the VCOCO dataset showed that the proposed method reduced the computational demand of the transformer by 57% without any loss of accuracy, as compared to other current HOI methods.

引用

页数：10

共 26 条

[1] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[2] Learning to Detect Human-Object Interactions [J].

Chao, Yu-Wei ;

Liu, Yunfan ;

Liu, Xieyang ;

Zeng, Huayi ;

Deng, Jia .

2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, :381-389

[3] DRG: Dual Relation Graph for Human-Object Interaction Detection [J].

Gao, Chen ;

Xu, Jiarui ;

Zou, Yuliang ;

Huang, Jia-Bin .

COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :696-712

[4]

Chen J., 2021, ArXiv, DOI [DOI 10.1038/S41592-020-01008-Z, DOI 10.1038/s41566-021-00828-5]

[5]

Gao C., 2018, BMVC, DOI [10.1109/radar.2018.8557284, DOI 10.1109/RADAR.2018.8557284]

[6] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[7] Detecting and Recognizing Human-Object Interactions [J].

Gkioxari, Georgia ;

Girshick, Ross ;

Dollar, Piotr ;

He, Kaiming .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :8359-8367

[8] Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning [J].

Guan, Zhibin ;

Liu, Kang ;

Ma, Yan ;

Qian, Xu ;

Ji, Tongkai .

SYMMETRY-BASEL, 2018, 10 (11)

[9]

Gupta S., 2015, arXiv

[10] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 →