Human-object interaction detection based on cascade multi-scale transformer

被引：6

作者：

Xia, Limin ^{[1
]}

Ding, Xiaoyue ^{[1
]}

机构：

[1] Cent South Univ, Sch Automat, Changsha 410083, Peoples R China

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Human-object interaction; Transformer; Multi-scale; Cascade decoders;

D O I：

10.1007/s10489-024-05324-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.

引用

页码：2831 / 2850

页数：20

共 47 条

[1] Human object interaction detection: Design and survey
Antoun, Maya
Asmar, Daniel
[J]. IMAGE AND VISION COMPUTING, 2023, 130
[2] Efficient Object Detection and Classification Approach Using HTYOLOV4 and M2RFO-CNN
Arulalan, V
Kumar, Dhananjay
[J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 44 (02): : 1703 - 1717
[3] Automatically detecting human-object interaction by an instance part-level attention deep framework
Bai, Lin
Chen, Fenglian
Tian, Yang
[J]. PATTERN RECOGNITION, 2022, 134
[4] Carion N., 2020, EUR C COMP VIS, P213, DOI 10.1007/978-3-030-58452-8_13
[5] Learning to Detect Human-Object Interactions
Chao, Yu-Wei
Liu, Yunfan
Liu, Xieyang
Zeng, Huayi
Deng, Jia
[J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 381 - 389
[6] Parallel disentangling network for human-object interaction detection
Cheng, Yamin
Duan, Hancong
Wang, Chen
Chen, Zhijun
[J]. PATTERN RECOGNITION, 2024, 146
[7] Multi-Scale Human-Object Interaction Detector
Cheng, Yamin
Wang, Zhi
Zhan, Wenhan
Duan, Hancong
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (04) : 1827 - 1838
[8] Rethinking vision transformer through human-object interaction detection
Cheng, Yamin
Zhao, Zitian
Wang, Zhi
Duan, Hancong
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 122
[9] Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos
Cores, Daniel
Brea, Victor M.
Mucientes, Manuel
[J]. APPLIED INTELLIGENCE, 2023, 53 (01) : 1205 - 1217
[10] Cloud Data-Driven Intelligent Monitoring System for Interactive Smart Farming
Dineva, Kristina
Atanasova, Tatiana
[J]. SENSORS, 2022, 22 (17)

← 1 2 3 4 5 →