Dynamic Pruning of Regions for Image-Sentence Matching

被引：1

作者：

Wu, Jie ^{[1
]}

Liu, Weifeng ^{[2
]}

Wang, Leiquan ^{[1
]}

Shen, Xiuxuan ^{[1
]}

Wei, Yiwei ^{[3
]}

Wu, Chunlei ^{[1
]}

机构：

[1] China Univ Petr East China, Coll Comp Sci & Technol, Qingdao, Peoples R China

[2] China Univ Petr East China, Coll Control Sci & Engn, Qingdao, Peoples R China

[3] China Univ Petr Beijing Karamay, Sch Petr Engn, Karamay, Peoples R China

来源：

SIGNAL PROCESSING-IMAGE COMMUNICATION | 2023年 / 117卷

关键词：

Image-sentence matching; Cross-modal retrieval; Region pruning; ATTENTION;

D O I：

10.1016/j.image.2023.117021

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image-sentence matching is becoming increasingly essential in the integrated understanding of vision and language. Prior approaches apply a pre-trained detection model to extract region features and explore finegrained relationships between image and sentence by aggregating the similarities of all region-word pairs. However, all images are represented by the same number of regions, regardless of their respective semantic complexity, which results in a large number of redundant regions interfering with semantic inference and bringing additional computational burden. To address the lack of flexibility in image representation and information redundancy, a novel method named Dynamic Pruning of Regions for Image-Sentence Matching (DPRM) is proposed to efficiently capture relationships between text and image. In particular, a dynamic region pruning module is presented to dynamically select the appropriate number of regions according to the semantic complexity of each image, thus pruning redundant regions and reducing superfluous computations. Moreover, an inter-modality refinement module is designed to refine the fine-grained relationships of region-word pairs by retaining meaningful interaction features and suppressing interference from redundant alignments, which learns the more accurate semantic correspondences. Extensive experiments on MSCOCO and Flickr30K datasets prove the superiority of DPRM compared with previous approaches.

引用

页数：10

共 40 条

[11] Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models [J].

Gu, Jiuxiang ;

Cai, Jianfei ;

Joty, Shafiq ;

Niu, Li ;

Wang, Gang .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7181-7189

[12]

Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]

[13] Visual Cluster Grounding for Image Captioning [J].

Jiang, Wenhui ;

Zhu, Minwei ;

Fang, Yuming ;

Shi, Guangming ;

Zhao, Xiaowei ;

Liu, Yang .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :3920-3934

[14]

Karpathy A, 2014, ADV NEUR IN, V27

[15]

Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932

[16]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[17]

Kiros R, 2014, Arxiv, DOI arXiv:1411.2539

[18] ImageNet Classification with Deep Convolutional Neural Networks [J].

Krizhevsky, Alex ;

Sutskever, Ilya ;

Hinton, Geoffrey E. .

COMMUNICATIONS OF THE ACM, 2017, 60 (06) :84-90

[19] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

[20] Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval [J].

Li, Jiangtong ;

Liu, Liu ;

Niu, Li ;

Zhang, Liqing .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 (30) :9193-9207

← 1 2 3 4 →