Adversarial Representation Learning for Text-to-Image Matching

被引:174
|
作者
Sarafianos, Nikolaos [1 ]
Xu, Xiang [1 ]
Kakadiaris, Ioannis A. [1 ]
机构
[1] Univ Houston, Computat Biomed Lab, Houston, TX 77004 USA
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.00591
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publicly-available language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-the-art cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.
引用
收藏
页码:5813 / 5823
页数:11
相关论文
共 50 条
  • [41] Text-to-image synthesis with self-supervised learning
    Tan, Yong Xuan
    Lee, Chin Poo
    Neo, Mai
    Lim, Kian Ming
    PATTERN RECOGNITION LETTERS, 2022, 157 : 119 - 126
  • [42] Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks
    Cheng, Qingrong
    Gu, Xiaodong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 : 483 - 495
  • [43] Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching
    Wei, Kaimin
    Zhou, Zhibo
    IEEE ACCESS, 2020, 8 (08): : 96237 - 96248
  • [44] TriMatch: Triple Matching for Text-to-Image Person Re-Identification
    Yan, Shuanglin
    Dong, Neng
    Li, Shuang
    Li, Huafeng
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 806 - 810
  • [45] Text-to-image synthesis based on modified deep convolutional generative adversarial network
    Li Y.
    Zhu M.
    Ren J.
    Su X.
    Zhou X.
    Yu H.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2023, 49 (08): : 1875 - 1883
  • [46] Bi-Attention enhanced representation learning for image-text matching
    Tian, Yumin
    Ding, Aqiang
    Wang, Di
    Luo, Xuemei
    Wan, Bo
    Wang, Yifeng
    PATTERN RECOGNITION, 2023, 140
  • [47] Self-attention guided representation learning for image-text matching
    Qi, Xuefei
    Zhang, Ying
    Qi, Jinqing
    Lu, Huchuan
    NEUROCOMPUTING, 2021, 450 : 143 - 155
  • [48] RACE : Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model
    Kim, Changhoon
    Min, Kyle
    Yang, Yezhou
    COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 461 - 478
  • [49] Multi-Semantic Fusion Generative Adversarial Network for Text-to-Image Generation
    Huang, Pingda
    Liu, Yedan
    Fu, Chunjiang
    Zhao, Liang
    2023 IEEE 8TH INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS, ICBDA, 2023, : 159 - 164
  • [50] Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network
    Zhang, Shengyu
    Dong, Hao
    Hu, Wei
    Guo, Yike
    Wu, Chao
    Xie, Di
    Wu, Fei
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 417 - 427