Human-object interaction detection based on cascade multi-scale transformer

被引：6

作者：

Xia, Limin ^{[1
]}

Ding, Xiaoyue ^{[1
]}

机构：

[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Human-object interaction; Transformer; Multi-scale; Cascade decoders;

D O I：

10.1007/s10489-024-05324-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.

引用

页码：2831 / 2850

页数：20

共 47 条

[41] Human object interaction detection based on feature optimization and key human-object enhancement
Ye, Qing
Wang, Xikun
Li, Rui
Zhang, Yongmei
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2023, 93
[42] Multiple attentional path aggregation network for marine object detection
Yu, Haifeng
Li, Xinbin
Feng, Yankai
Han, Song
[J]. APPLIED INTELLIGENCE, 2023, 53 (02) : 2434 - 2451
[43] Zhang AX, 2021, ADV NEUR IN, V34
[44] Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
Zhang, Huawei
Ma, Chengbo
Jiang, Zhanjun
Lian, Jing
[J]. IEEE ACCESS, 2023, 11 : 134 - 143
[45] Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection
Zhong, Xubin
Qu, Xian
Ding, Changxing
Tao, Dacheng
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13229 - 13238
[46] Zhu X., 2021, INT C LEARNING REPRE
[47] End-to-End Human Object Interaction Detection with HOI Transformer
Zou, Cheng
Wang, Bohan
Hu, Yue
Liu, Junqi
Wu, Qian
Zhao, Yu
Li, Boxun
Zhang, Chenguang
Zhang, Chi
Wei, Yichen
Sun, Jian
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11820 - 11829

← 1 2 3 4 5 →