IMC-Det: Intra-Inter Modality Contrastive Learning for Video Object Detection

被引：0

作者：

Qi, Qiang ^{[1
]}

Qiu, Zhenyu ^{[1
]}

Yan, Yan ^{[1
]}

Lu, Yang ^{[1
]}

Wang, Hanzi ^{[1
]}

机构：

[1] Xiamen Univ, Sch Informat, Fujian Key Lab Sensing & Comp Smart City, Xiamen 361005, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2025年 / 133卷 / 02期

基金：

中国国家自然科学基金;

关键词：

Video object detection; Intra-inter modality learning; Contrastive learning; Feature aggregation; AGGREGATION; NETWORK;

D O I：

10.1007/s11263-024-02201-9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra-inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.

引用

页码：890 / 909

页数：20

共 88 条

[1]

Adarsh P, 2020, INT CONF ADVAN COMPU, P687, DOI [10.1109/icaccs48705.2020.9074315, 10.1109/ICACCS48705.2020.9074315]

[2]

Carion N. etal, 2020, EUR C COMP VIS, P213

[3] Optimizing Video Object Detection via a Scale-Time Lattice [J].

Chen, Kai ;

Wang, Jiaqi ;

Yang, Shuo ;

Zhang, Xingcheng ;

Xiong, Yuanjun ;

Loy, Chen Change ;

Lin, Dahua .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7814-7823

[4] DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training [J].

Chen, Yihao ;

Qi, Xianbiao ;

Wang, Jianan ;

Zhang, Lei .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22648-22657

[5] Memory Enhanced Global-Local Aggregation for Video Object Detection [J].

Chen, Yihong ;

Cao, Yue ;

Hu, Han ;

Wang, Liwei .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343

[6] TF-Blender: Temporal Feature Blender for Video Object Detection [J].

Cui, Yiming ;

Yan, Liqi ;

Cao, Zhiwen ;

Liu, Dongfang .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8118-8127

[7] Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [J].

Damen, Dima ;

Doughty, Hazel ;

Farinella, Giovanni Maria ;

Fidler, Sanja ;

Furnari, Antonino ;

Kazakos, Evangelos ;

Moltisanti, Davide ;

Munro, Jonathan ;

Perrett, Toby ;

Price, Will ;

Wray, Michael .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :753-771

[8] Object Guided External Memory Network for Video Object Detection [J].

Deng, Hanming ;

Hua, Yang ;

Song, Tao ;

Zhang, Zongpu ;

Xue, Zhengui ;

Ma, Ruhui ;

Robertson, Neil ;

Guan, Haibing .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6677-6686

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10] MINet: Meta-Learning Instance Identifiers for Video Object Detection [J].

Deng, Jiajun ;

Pan, Yingwei ;

Yao, Ting ;

Zhou, Wengang ;

Li, Houqiang ;

Mei, Tao .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6879-6891

← 1 2 3 4 5 6 7 8 9 →