Explainability Enhanced Object Detection Transformer With Feature Disentanglement

被引：0

作者：

Yu, Wenlong ^{[1
,2
]}

Liu, Ruonan ^{[1
,2
]}

Chen, Dongyue ^{[1
,2
]}

Hu, Qinghua ^{[1
,2
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China

[2] Tianjin Univ, Tianjin Key Lab Machine Learning, Tianjin 300350, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

关键词：

Feature extraction; Transformers; Object detection; Mathematical models; Computational modeling; Analytical models; Visualization; Semantics; Deep learning; Vectors; explainability; feature disentanglement; hybrid transformer model; object detection; REPRESENTATION; MODELS;

D O I：

10.1109/TIP.2024.3492733

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Explainability is a pivotal factor in determining whether a deep learning model can be authorized in critical applications. To enhance the explainability of models of end-to-end object DEtection with TRansformer (DETR), we introduce a disentanglement method that constrains the feature learning process, following a divide-and-conquer decoupling paradigm, similar to how people understand complex real-world problems. We first demonstrate the entangled property of the features between the extractor and detector and find that the regression function is a key factor contributing to the deterioration of disentangled feature activation. These highly entangled features always activate the local characteristics, making it difficult to cover the semantic information of an object, which also reduces the interpretability of single-backbone object detection models. Thus, an Explainability Enhanced object detection Transformer with feature Disentanglement (DETD) model is proposed, in which the Tensor Singular Value Decomposition (T-SVD) is used to produce feature bases and the Batch averaged Feature Spectral Penalization (BFSP) loss is introduced to constrain the disentanglement of the feature and balance the semantic activation. The proposed method is applied across three prominent backbones, two DETR variants, and a CNN based model. By combining two optimization techniques, extensive experiments on two datasets consistently demonstrate that the DETD model outperforms the counterpart in terms of object detection performance and feature disentanglement. The Grad-CAM visualizations demonstrate the enhancement of feature learning explainability in the disentanglement view.

引用

页码：6439 / 6454

页数：16

共 66 条

[1] Abnar S, 2020, Arxiv, DOI arXiv:2005.00928
[2] Explainable Artificial Intelligence for Autonomous Driving: A Comprehensive Overview and Field Guide for Future Research Directions
Atakishiyev, Shahin
Salameh, Mohammad
Yao, Hengshuai
Goebel, Randy
[J]. IEEE ACCESS, 2024, 12 : 101603 - 101625
[3] Bar A, 2022, Arxiv, DOI arXiv:2106.04550
[4] Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers
Binder, Alexander
Montavon, Gregoire
Lapuschkin, Sebastian
Mueller, Klaus-Robert
Samek, Wojciech
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2016, PT II, 2016, 9887 : 63 - 71
[5] Carbonneau MA, 2022, Arxiv, DOI arXiv:2012.09276
[6] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[7] Transformer Interpretability Beyond Attention Visualization
Chefer, Hila
Gur, Shir
Wolf, Lior
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 782 - 791
[8] Chen JB, 2018, PR MACH LEARN RES, V80
[9] Chen XY, 2019, PR MACH LEARN RES, V97
[10] Deeply Explain CNN Via Hierarchical Decomposition
Cheng, Ming-Ming
Jiang, Peng-Tao
Han, Ling-Hao
Wang, Liang
Torr, Philip
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (05) : 1091 - 1105

← 1 2 3 4 5 6 7 →