Cross-Domain Detection Transformer Based on Spatial-Aware and Semantic-Aware Token Alignment

被引:6
作者
Deng, Jinhong [1 ]
Zhang, Xiaoyue [1 ]
Li, Wen [2 ]
Duan, Lixin [1 ]
Xu, Dong [1 ,3 ]
机构
[1] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 611731, Peoples R China
[2] Univ Elect Sci & Technol China, Shenzhen Inst Adv Study, Shenzhen 518110, Peoples R China
[3] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Training; Object detection; Feature extraction; Task analysis; Semantics; Decoding; Detection transformer; domain adaptation; object detection;
D O I
10.1109/TMM.2023.3330524
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detection transformers such as DETR (Carion et al., 2020) have recently exhibited promising performance for many object detection tasks, but the generalization ability of those methods is still quite limited for cross-domain adaptation scenarios. To address the cross-domain issue, a straightforward method is to perform token alignment with adversarial training in transformers. However, its performance is often unsatisfactory because the tokens in detection transformers are quite diverse and represent different spatial and semantic information. In this paper, we propose a new method for cross-domain detection transformers called spatial-aware and semantic-aware token alignment (SSTA). Specifically, we take advantage of the characteristics of cross-attention as used in the detection transformer and propose spatial-aware token alignment (SpaTA) and semantic-aware token alignment (SemTA) strategies to guide the token alignment across domains. For spatial-aware token alignment, we extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries. For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embedding to guide the learning of a multi-class discriminator to model the category relationship and achieve category-level token alignment during the entire adaptation process. We conduct extensive experiments on several widely-used benchmarks, and the results clearly show the effectiveness of our proposed approach over existing state-of-the-art methods.
引用
收藏
页码:5234 / 5245
页数:12
相关论文
共 69 条
[1]   Exploring Object Relation in Mean Teacher for Cross-Domain Detection [J].
Cai, Qi ;
Pan, Yingwei ;
Ngo, Chong-Wah ;
Tian, Xinmei ;
Duan, Lingyu ;
Yao, Ting .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :11449-11458
[2]   Cascade R-CNN: Delving into High Quality Object Detection [J].
Cai, Zhaowei ;
Vasconcelos, Nuno .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6154-6162
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]   A Comprehensive Survey of Scene Graphs: Generation and Application [J].
Chang, Xiaojun ;
Ren, Pengzhen ;
Xu, Pengfei ;
Li, Zhihui ;
Chen, Xiaojiang ;
Hauptmann, Alex .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :1-26
[5]   Harmonizing Transferability and Discriminability for Adapting Object Detectors [J].
Chen, Chaoqi ;
Zheng, Zebiao ;
Ding, Xinghao ;
Huang, Yue ;
Dou, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8866-8875
[6]   Adaptive Convolution for Object Detection [J].
Chen, Chunlin ;
Ling, Qiang .
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (12) :3205-3217
[7]   Conditional Context-Aware Feature Alignment for Domain Adaptive Detection Transformer [J].
Chen, Siyuan .
MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 :272-283
[8]   Domain Adaptive Faster R-CNN for Object Detection in the Wild [J].
Chen, Yuhua ;
Li, Wen ;
Sakaridis, Christos ;
Dai, Dengxin ;
Van Gool, Luc .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3339-3348
[9]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848