Adaptive Offline Quintuplet Loss for Image-Text Matching

被引:44
作者
Chen, Tianlang [1 ]
Deng, Jiajun [2 ]
Luo, Jiebo [1 ]
机构
[1] Univ Rochester, Rochester, NY 14627 USA
[2] Univ Sci & Technol China, Hefei, Peoples R China
来源
COMPUTER VISION - ECCV 2020, PT XIII | 2020年 / 12358卷
关键词
Image-text matching; Triplet loss; Hard negative mining;
D O I
10.1007/978-3-030-58601-0_33
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the model's capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different "hardness" degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides "harder" offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the model's generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as the intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, proving the effectiveness and generality of our approach. Code is available at https://github.com/sunnychencool/AOQ.
引用
收藏
页码:549 / 565
页数:17
相关论文
共 36 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]  
Chen TL, 2020, Arxiv, DOI arXiv:2002.08510
[3]   "Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention [J].
Chen, Tianlang ;
Zhang, Zhongping ;
You, Quanzeng ;
Fang, Chen ;
Wang, Zhaowen ;
Jin, Hailin ;
Luo, Jiebo .
COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :527-543
[4]   Beyond triplet loss: a deep quadruplet network for person re-identification [J].
Chen, Weihua ;
Chen, Xiaotang ;
Zhang, Jianguo ;
Huang, Kaiqi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1320-1329
[5]   Linking Image and Text with 2-Way Nets [J].
Eisenschtat, Aviv ;
Wolf, Lior .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1855-1865
[6]  
Faghri F, 2018, Arxiv, DOI [arXiv:1707.05612, DOI 10.48550/ARXIV.1707.05612]
[7]  
Frome A., 2013, Advances in neural information processing systems, V26, P2121
[8]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[9]   ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching [J].
Huang, Yan ;
Wang, Liang .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :5773-5782
[10]   Instance-aware Image and Sentence Matching with Selective Multimodal LSTM [J].
Huang, Yan ;
Wang, Wei ;
Wang, Liang .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :7254-7262