IMC-Det: Intra-Inter Modality Contrastive Learning for Video Object Detection

被引:0
作者
Qi, Qiang [1 ]
Qiu, Zhenyu [1 ]
Yan, Yan [1 ]
Lu, Yang [1 ]
Wang, Hanzi [1 ]
机构
[1] Xiamen Univ, Sch Informat, Fujian Key Lab Sensing & Comp Smart City, Xiamen 361005, Peoples R China
基金
中国国家自然科学基金;
关键词
Video object detection; Intra-inter modality learning; Contrastive learning; Feature aggregation; AGGREGATION; NETWORK;
D O I
10.1007/s11263-024-02201-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video object detection is an important yet challenging task in the computer vision field. One limitation of off-the-shelf video object detection methods is that they only explore information from the visual modality, without considering the semantic knowledge of the textual modality due to the large inter-modality discrepancies, resulting in limited detection performance. In this paper, we propose a novel intra-inter modality contrastive learning network for high-performance video object detection (IMC-Det), which includes three substantial improvements over existing methods. First, we design an intra-modality contrastive learning module to pull close similar features while pushing apart dissimilar ones, enabling our IMC-Det to learn more discriminative feature representations. Second, we develop a graph relational feature aggregation module to effectively model the structural relations between features by leveraging cross-graph learning and residual graph convolution, which is conducive to performing more effective feature aggregation in the spatio-temporal domain. Third, we present an inter-modality contrastive learning module to enforce the visual features belonging to same classes to be compactly gathered around the corresponding textual semantic representations, endowing our IMC-Det with better object classification capability. We conduct extensive experiments on the challenging ImageNet VID dataset, and the experimental results demonstrate that our IMC-Det performs favorably against existing state-of-the-art methods. More remarkably, our IMC-Det achieves 85.5% mAP and 86.7% mAP with ResNet-101 and ResNeXt-101, respectively.
引用
收藏
页码:890 / 909
页数:20
相关论文
共 88 条
[1]  
Adarsh P, 2020, INT CONF ADVAN COMPU, P687, DOI [10.1109/icaccs48705.2020.9074315, 10.1109/ICACCS48705.2020.9074315]
[2]  
Carion N. etal, 2020, EUR C COMP VIS, P213
[3]   Optimizing Video Object Detection via a Scale-Time Lattice [J].
Chen, Kai ;
Wang, Jiaqi ;
Yang, Shuo ;
Zhang, Xingcheng ;
Xiong, Yuanjun ;
Loy, Chen Change ;
Lin, Dahua .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7814-7823
[4]   DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training [J].
Chen, Yihao ;
Qi, Xianbiao ;
Wang, Jianan ;
Zhang, Lei .
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :22648-22657
[5]   Memory Enhanced Global-Local Aggregation for Video Object Detection [J].
Chen, Yihong ;
Cao, Yue ;
Hu, Han ;
Wang, Liwei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10334-10343
[6]   TF-Blender: Temporal Feature Blender for Video Object Detection [J].
Cui, Yiming ;
Yan, Liqi ;
Cao, Zhiwen ;
Liu, Dongfang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :8118-8127
[7]   Scaling Egocentric Vision: The EPIC-KITCHENS Dataset [J].
Damen, Dima ;
Doughty, Hazel ;
Farinella, Giovanni Maria ;
Fidler, Sanja ;
Furnari, Antonino ;
Kazakos, Evangelos ;
Moltisanti, Davide ;
Munro, Jonathan ;
Perrett, Toby ;
Price, Will ;
Wray, Michael .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :753-771
[8]   Object Guided External Memory Network for Video Object Detection [J].
Deng, Hanming ;
Hua, Yang ;
Song, Tao ;
Zhang, Zongpu ;
Xue, Zhengui ;
Ma, Ruhui ;
Robertson, Neil ;
Guan, Haibing .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6677-6686
[9]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[10]   MINet: Meta-Learning Instance Identifiers for Video Object Detection [J].
Deng, Jiajun ;
Pan, Yingwei ;
Yao, Ting ;
Zhou, Wengang ;
Li, Houqiang ;
Mei, Tao .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :6879-6891