Human-object interaction detection based on disentangled axial attention transformer

被引：0

作者：

Xia, Limin ^{[1
]}

Xiao, Qiyue ^{[1
]}

机构：

[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China

来源：

MACHINE VISION AND APPLICATIONS | 2024年 / 35卷 / 04期

关键词：

Human-object interaction dection; Transformer; Disentanglement strategy; Axial attention;

D O I：

10.1007/s00138-024-01558-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human-object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder-decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model's computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.

引用

页数：17

共 53 条

[1] Human object interaction detection: Design and survey
Antoun, Maya
Asmar, Daniel
[J]. IMAGE AND VISION COMPUTING, 2023, 130
[2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[3] Learning to Detect Human-Object Interactions
Chao, Yu-Wei
Liu, Yunfan
Liu, Xieyang
Zeng, Huayi
Deng, Jia
[J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 381 - 389
[4] HICO: A Benchmark for Recognizing Human-Object Interactions in Images
Chao, Yu-Wei
Wang, Zhan
He, Yugeng
Wang, Jiaxuan
Deng, Jia
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1017 - 1025
[5] Chen Gao, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12357), P696, DOI 10.1007/978-3-030-58610-2_41
[6] Reformulating HOI Detection as Adaptive Set Prediction
Chen, Mingfei
Liao, Yue
Liu, Si
Chen, Zhiyuan
Wang, Fei
Qian, Chen
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9000 - 9009
[7] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
[8] Parallel disentangling network for human-object interaction detection
Cheng, Yamin
Duan, Hancong
Wang, Chen
Chen, Zhijun
[J]. PATTERN RECOGNITION, 2024, 146
[9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10] Gao C., 2018, 2018 International Conference on Radar (RADAR), P41, DOI DOI 10.1109/RADAR.2018.8557284

← 1 2 3 4 5 6 →