Dynamic Pruning of Regions for Image-Sentence Matching

被引:1
作者
Wu, Jie [1 ]
Liu, Weifeng [2 ]
Wang, Leiquan [1 ]
Shen, Xiuxuan [1 ]
Wei, Yiwei [3 ]
Wu, Chunlei [1 ]
机构
[1] China Univ Petr East China, Coll Comp Sci & Technol, Qingdao, Peoples R China
[2] China Univ Petr East China, Coll Control Sci & Engn, Qingdao, Peoples R China
[3] China Univ Petr Beijing Karamay, Sch Petr Engn, Karamay, Peoples R China
关键词
Image-sentence matching; Cross-modal retrieval; Region pruning; ATTENTION;
D O I
10.1016/j.image.2023.117021
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image-sentence matching is becoming increasingly essential in the integrated understanding of vision and language. Prior approaches apply a pre-trained detection model to extract region features and explore finegrained relationships between image and sentence by aggregating the similarities of all region-word pairs. However, all images are represented by the same number of regions, regardless of their respective semantic complexity, which results in a large number of redundant regions interfering with semantic inference and bringing additional computational burden. To address the lack of flexibility in image representation and information redundancy, a novel method named Dynamic Pruning of Regions for Image-Sentence Matching (DPRM) is proposed to efficiently capture relationships between text and image. In particular, a dynamic region pruning module is presented to dynamically select the appropriate number of regions according to the semantic complexity of each image, thus pruning redundant regions and reducing superfluous computations. Moreover, an inter-modality refinement module is designed to refine the fine-grained relationships of region-word pairs by retaining meaningful interaction features and suppressing interference from redundant alignments, which learns the more accurate semantic correspondences. Extensive experiments on MSCOCO and Flickr30K datasets prove the superiority of DPRM compared with previous approaches.
引用
收藏
页数:10
相关论文
共 40 条
[21]  
Li JY, 2021, ADV NEUR IN, V34
[22]   Visual Semantic Reasoning for Image-Text Matching [J].
Li, Kunpeng ;
Zhang, Yulun ;
Li, Kai ;
Li, Yuanyuan ;
Fu, Yun .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4653-4661
[23]   Learning Dynamic Routing for Semantic Segmentation [J].
Li, Yanwei ;
Song, Lin ;
Chen, Yukang ;
Li, Zeming ;
Zhang, Xiangyu ;
Wang, Xingang ;
Sun, Jian .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8550-8559
[24]   Microsoft COCO: Common Objects in Context [J].
Lin, Tsung-Yi ;
Maire, Michael ;
Belongie, Serge ;
Hays, James ;
Perona, Pietro ;
Ramanan, Deva ;
Dollar, Piotr ;
Zitnick, C. Lawrence .
COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755
[25]   Graph Structured Network for Image-Text Matching [J].
Liu, Chunxiao ;
Mao, Zhendong ;
Zhang, Tianzhu ;
Xie, Hongtao ;
Wang, Bin ;
Zhang, Yongdong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10918-10927
[26]   COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [J].
Lu, Haoyu ;
Fei, Nanyi ;
Huo, Yuqi ;
Gao, Yizhao ;
Lu, Zhiwu ;
Wen, Ji-Rong .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :15671-15680
[27]   CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising* [J].
Luo, Jianjie ;
Li, Yehao ;
Pan, Yingwei ;
Yao, Ting ;
Chao, Hongyang ;
Mei, Tao .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :5600-5608
[28]   Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks [J].
Ren, Shaoqing ;
He, Kaiming ;
Girshick, Ross ;
Sun, Jian .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (06) :1137-1149
[29]  
Song L., 2020, P ADV NEUR INF PROC, V33, P11131
[30]  
Vaswani A, 2017, ADV NEUR IN, V30