Dynamic Pruning of Regions for Image-Sentence Matching

被引:1
作者
Wu, Jie [1 ]
Liu, Weifeng [2 ]
Wang, Leiquan [1 ]
Shen, Xiuxuan [1 ]
Wei, Yiwei [3 ]
Wu, Chunlei [1 ]
机构
[1] China Univ Petr East China, Coll Comp Sci & Technol, Qingdao, Peoples R China
[2] China Univ Petr East China, Coll Control Sci & Engn, Qingdao, Peoples R China
[3] China Univ Petr Beijing Karamay, Sch Petr Engn, Karamay, Peoples R China
关键词
Image-sentence matching; Cross-modal retrieval; Region pruning; ATTENTION;
D O I
10.1016/j.image.2023.117021
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image-sentence matching is becoming increasingly essential in the integrated understanding of vision and language. Prior approaches apply a pre-trained detection model to extract region features and explore finegrained relationships between image and sentence by aggregating the similarities of all region-word pairs. However, all images are represented by the same number of regions, regardless of their respective semantic complexity, which results in a large number of redundant regions interfering with semantic inference and bringing additional computational burden. To address the lack of flexibility in image representation and information redundancy, a novel method named Dynamic Pruning of Regions for Image-Sentence Matching (DPRM) is proposed to efficiently capture relationships between text and image. In particular, a dynamic region pruning module is presented to dynamically select the appropriate number of regions according to the semantic complexity of each image, thus pruning redundant regions and reducing superfluous computations. Moreover, an inter-modality refinement module is designed to refine the fine-grained relationships of region-word pairs by retaining meaningful interaction features and suppressing interference from redundant alignments, which learns the more accurate semantic correspondences. Extensive experiments on MSCOCO and Flickr30K datasets prove the superiority of DPRM compared with previous approaches.
引用
收藏
页数:10
相关论文
共 40 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
[Anonymous], 2014, T ASSOC COMPUT LING
[3]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[4]   Learning the Best Pooling Strategy for Visual Semantic Embedding [J].
Chen, Jiacheng ;
Hu, Hexiang ;
Wu, Hao ;
Jiang, Yuning ;
Wang, Changhu .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15784-15793
[5]   Dynamic Region-Aware Convolution [J].
Chen, Jin ;
Wang, Xijun ;
Guo, Zichao ;
Zhang, Xiangyu ;
Sun, Jian .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8060-8069
[6]   Probabilistic Embeddings for Cross-Modal Retrieval [J].
Chun, Sanghyuk ;
Oh, Seong Joon ;
de Rezende, Rafael Sampaio ;
Kalantidis, Yannis ;
Larlus, Diane .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8411-8420
[7]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[8]  
Diao HW, 2021, AAAI CONF ARTIF INTE, V35, P1218
[9]  
Faghri F ..., 2018, BRIT MACHINE VISION, P12
[10]   Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering [J].
Gao, Peng ;
Jiang, Zhengkai ;
You, Haoxuan ;
Lu, Pan ;
Hoi, Steven ;
Wang, Xiaogang ;
Li, Hongsheng .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :6632-6641