Human-object interaction detection based on cascade multi-scale transformer

被引:6
作者
Xia, Limin [1 ]
Ding, Xiaoyue [1 ]
机构
[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China
基金
中国国家自然科学基金;
关键词
Human-object interaction; Transformer; Multi-scale; Cascade decoders;
D O I
10.1007/s10489-024-05324-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.
引用
收藏
页码:2831 / 2850
页数:20
相关论文
共 47 条
  • [41] Human object interaction detection based on feature optimization and key human-object enhancement
    Ye, Qing
    Wang, Xikun
    Li, Rui
    Zhang, Yongmei
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 93
  • [42] Multiple attentional path aggregation network for marine object detection
    Yu, Haifeng
    Li, Xinbin
    Feng, Yankai
    Han, Song
    [J]. APPLIED INTELLIGENCE, 2023, 53 (02) : 2434 - 2451
  • [43] Zhang AX, 2021, ADV NEUR IN, V34
  • [44] Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
    Zhang, Huawei
    Ma, Chengbo
    Jiang, Zhanjun
    Lian, Jing
    [J]. IEEE ACCESS, 2023, 11 : 134 - 143
  • [45] Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection
    Zhong, Xubin
    Qu, Xian
    Ding, Changxing
    Tao, Dacheng
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13229 - 13238
  • [46] Zhu X., 2021, INT C LEARNING REPRE
  • [47] End-to-End Human Object Interaction Detection with HOI Transformer
    Zou, Cheng
    Wang, Bohan
    Hu, Yue
    Liu, Junqi
    Wu, Qian
    Zhao, Yu
    Li, Boxun
    Zhang, Chenguang
    Zhang, Chi
    Wei, Yichen
    Sun, Jian
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11820 - 11829