Human-object interaction detection based on cascade multi-scale transformer

被引:6
作者
Xia, Limin [1 ]
Ding, Xiaoyue [1 ]
机构
[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China
基金
中国国家自然科学基金;
关键词
Human-object interaction; Transformer; Multi-scale; Cascade decoders;
D O I
10.1007/s10489-024-05324-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.
引用
收藏
页码:2831 / 2850
页数:20
相关论文
共 47 条
  • [1] Human object interaction detection: Design and survey
    Antoun, Maya
    Asmar, Daniel
    [J]. IMAGE AND VISION COMPUTING, 2023, 130
  • [2] Efficient Object Detection and Classification Approach Using HTYOLOV4 and M2RFO-CNN
    Arulalan, V
    Kumar, Dhananjay
    [J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 44 (02): : 1703 - 1717
  • [3] Automatically detecting human-object interaction by an instance part-level attention deep framework
    Bai, Lin
    Chen, Fenglian
    Tian, Yang
    [J]. PATTERN RECOGNITION, 2022, 134
  • [4] Carion N., 2020, EUR C COMP VIS, P213, DOI 10.1007/978-3-030-58452-8_13
  • [5] Learning to Detect Human-Object Interactions
    Chao, Yu-Wei
    Liu, Yunfan
    Liu, Xieyang
    Zeng, Huayi
    Deng, Jia
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 381 - 389
  • [6] Parallel disentangling network for human-object interaction detection
    Cheng, Yamin
    Duan, Hancong
    Wang, Chen
    Chen, Zhijun
    [J]. PATTERN RECOGNITION, 2024, 146
  • [7] Multi-Scale Human-Object Interaction Detector
    Cheng, Yamin
    Wang, Zhi
    Zhan, Wenhan
    Duan, Hancong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (04) : 1827 - 1838
  • [8] Rethinking vision transformer through human-object interaction detection
    Cheng, Yamin
    Zhao, Zitian
    Wang, Zhi
    Duan, Hancong
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 122
  • [9] Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos
    Cores, Daniel
    Brea, Victor M.
    Mucientes, Manuel
    [J]. APPLIED INTELLIGENCE, 2023, 53 (01) : 1205 - 1217
  • [10] Cloud Data-Driven Intelligent Monitoring System for Interactive Smart Farming
    Dineva, Kristina
    Atanasova, Tatiana
    [J]. SENSORS, 2022, 22 (17)