Human-object interaction detection based on disentangled axial attention transformer

被引:0
作者
Xia, Limin [1 ]
Xiao, Qiyue [1 ]
机构
[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China
关键词
Human-object interaction dection; Transformer; Disentanglement strategy; Axial attention;
D O I
10.1007/s00138-024-01558-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder-decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model's computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.
引用
收藏
页数:17
相关论文
共 53 条
  • [1] Human object interaction detection: Design and survey
    Antoun, Maya
    Asmar, Daniel
    [J]. IMAGE AND VISION COMPUTING, 2023, 130
  • [2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [3] Learning to Detect Human-Object Interactions
    Chao, Yu-Wei
    Liu, Yunfan
    Liu, Xieyang
    Zeng, Huayi
    Deng, Jia
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 381 - 389
  • [4] HICO: A Benchmark for Recognizing Human-Object Interactions in Images
    Chao, Yu-Wei
    Wang, Zhan
    He, Yugeng
    Wang, Jiaxuan
    Deng, Jia
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1017 - 1025
  • [5] Chen Gao, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12357), P696, DOI 10.1007/978-3-030-58610-2_41
  • [6] Reformulating HOI Detection as Adaptive Set Prediction
    Chen, Mingfei
    Liao, Yue
    Liu, Si
    Chen, Zhiyuan
    Wang, Fei
    Qian, Chen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 9000 - 9009
  • [7] Chen SZ, 2020, PROC CVPR IEEE, P9959, DOI 10.1109/CVPR42600.2020.00998
  • [8] Parallel disentangling network for human-object interaction detection
    Cheng, Yamin
    Duan, Hancong
    Wang, Chen
    Chen, Zhijun
    [J]. PATTERN RECOGNITION, 2024, 146
  • [9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [10] Gao C., 2018, 2018 International Conference on Radar (RADAR), P41, DOI DOI 10.1109/RADAR.2018.8557284