Human-object interaction detection based on disentangled axial attention transformer

被引:0
作者
Xia, Limin [1 ]
Xiao, Qiyue [1 ]
机构
[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China
关键词
Human-object interaction dection; Transformer; Disentanglement strategy; Axial attention;
D O I
10.1007/s00138-024-01558-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder-decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model's computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.
引用
收藏
页数:17
相关论文
共 53 条
[51]   Polysemy Deciphering Network for Robust Human-Object Interaction Detection [J].
Zhong, Xubin ;
Ding, Changxing ;
Qu, Xian ;
Tao, Dacheng .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (06) :1910-1929
[52]   Human-Object Interaction Detection via Disentangled Transformer [J].
Zhou, Desen ;
Liu, Zhichao ;
Wang, Jian ;
Wang, Leshan ;
Hu, Tao ;
Ding, Errui ;
Wang, Jingdong .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :19546-19555
[53]   End-to-End Human Object Interaction Detection with HOI Transformer [J].
Zou, Cheng ;
Wang, Bohan ;
Hu, Yue ;
Liu, Junqi ;
Wu, Qian ;
Zhao, Yu ;
Li, Boxun ;
Zhang, Chenguang ;
Zhang, Chi ;
Wei, Yichen ;
Sun, Jian .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :11820-11829