Multi-Scale Human-Object Interaction Detector

被引:11
作者
Cheng, Yamin [1 ]
Wang, Zhi [1 ]
Zhan, Wenhan [1 ]
Duan, Hancong [1 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
关键词
Transformers; Detectors; Computer architecture; Task analysis; Decoding; Iterative decoding; Feature extraction; Human-object interaction; vision transformer; multi-scale; NETWORK;
D O I
10.1109/TCSVT.2022.3216663
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Transformers are transforming the landscape of computer vision, especially for image-level recognition and instance-level detection tasks. Human-object interaction detection transformer (HOI-TR) is the first transformer-based end-to-end learning system for human-object interaction (HOI) detection; vision transformers build a simple multi-stage structure for multi-scale representation with single-scale patch and are the first patch-based transformer architecture for image-level recognition and instance-level detection. In this paper, we build a transformer-based multi-scale human-object interaction detector (MHOI), a novel method to integrate Vision and HOI detection Transformer, instead of directly incorporating two types of transformers, since the vision transformer lacks hierarchical architecture to handle the large variations in the scale of visual entities due to the single-scale patch partitioning. Specifically, MHOI embeds features of the same size (i.e., sequence length) with patches of variable scales simultaneously by utilizing overlapping convolutional patch embedding, then introduces an efficient transformer decoder that designs the query based on anchor points and essential auxiliary techniques to boost the HOI detection performance. Numerically, extensive experiments on several benchmarks demonstrate that our proposed framework outperforms prior existing methods coherently and achieves the impressive performance of 29.67 mAP on HICO-DET and 58.7 mAP on V-COCO, respectively.
引用
收藏
页码:1827 / 1838
页数:12
相关论文
共 73 条
  • [21] Gaze Target Estimation Inspired by Interactive Attention
    Hu, Zhengxi
    Zhao, Kunxu
    Zhou, Bohan
    Guo, Hang
    Wu, Shichao
    Yang, Yuxue
    Liu, Jingtai
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (12) : 8524 - 8536
  • [22] Ioffe Sergey, 2015, International conference on machine learning, V37, P448, DOI DOI 10.48550/ARXIV.1502.03167
  • [23] Jiang ZH, 2021, ADV NEUR IN, V34
  • [24] Kim B., 2020, COMPUTER VISION ECCV, P498, DOI DOI 10.1007/978-3-030
  • [25] HOTR: End-to-End Human-Object Interaction Detection with Transformers
    Kim, Bumsoo
    Lee, Junhyun
    Kang, Jaewoo
    Kim, Eun-Sol
    Kim, Hyunwoo J.
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 74 - 83
  • [26] Li Y., 2021, ARXIV
  • [27] Li Yaxin, 2020, ARXIV
  • [28] Detailed 2D-3D Joint Representation for Human-Object Interaction
    Li, Yong-Lu
    Liu, Xinpeng
    Lu, Han
    Wang, Shiyi
    Liu, Junqi
    Li, Jiefeng
    Lu, Cewu
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 10163 - 10172
  • [29] PPDM: Parallel Point Detection and Matching for Real-time Human-Object Interaction Detection
    Liao, Yue
    Liu, Si
    Wang, Fei
    Chen, Yanjie
    Qian, Chen
    Feng, Jiashi
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 479 - 487
  • [30] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755