ATTEND, CORRECT AND FOCUS: A BIDIRECTIONAL CORRECT ATTENTION NETWORK FOR IMAGE-TEXT MATCHING

被引：4

作者：

Liu, Yang ^{[1
]}

Wang, Huaqiu ^{[1
]}

Meng, Fanyang ^{[2
]}

Liu, Mengyuan ^{[3
]}

Liu, Hong ^{[4
]}

机构：

[1] Chongqing Univ Technol, Sch Artificial Intelligence, Chongqing, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Peoples R China

[4] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Beijing, Peoples R China

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP) | 2021年

关键词：

Image-text matching; cross modal retrieval; attention mechanism;

D O I：

10.1109/ICIP42928.2021.9506438

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image-text matching task aims to learn the fine-grained correspondences between images and sentences. Existing methods use attention mechanism to learn the correspondences by attending to all fragments without considering the relationship between fragments and global semantics, which inevitably lead to semantic misalignment among irrelevant fragments. To this end, we propose a Bidirectional Correct Attention Network (BCAN), which leverages global similarities and local similarities to reassign the attention weight, to avoid such semantic misalignment. Specifically, we introduce a global correct unit to correct the attention focused on relevant fragments in irrelevant semantics. A local correct unit is used to correct the attention focused on irrelevant fragments in relevant semantics. Experiments on Flickr30K and MSCOCO datasets verify the effectiveness of our proposed BCAN by outperforming both previous attention-based methods and state-of-the-art methods.

引用

页码：2673 / 2677

页数：5

共 18 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] Knowledge Aided Consistency for Weakly Supervised Phrase Grounding [J].

Chen, Kan ;

Gao, Jiyang ;

Nevatia, Ram .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4042-4050

[3] Learning to Evaluate Image Captioning [J].

Cui, Yin ;

Yang, Guandao ;

Veit, Andreas ;

Huang, Xun ;

Belongie, Serge .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :5804-5812

[4]

Faghri Fartash, 2017, arXiv

[5] Scene Graph Generation with External Knowledge and Image Reconstruction [J].

Gu, Jiuxiang ;

Zhao, Handong ;

Lin, Zhe ;

Li, Sheng ;

Cai, Jianfei ;

Ling, Mingyang .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1969-1978

[6]

Hu ZB, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P789

[7]

Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932

[8]

Kiros R, 2014, PR MACH LEARN RES, V32, P595

[9] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

[10] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

← 1 2 →